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x  ABSTRACT 

In  this  paper  we  describe  several  attempts  to  mode!  speech  perception  in  terms 
of  a  processing  system  in  which  knowledge  and  processing  is  distributed  over  large 
numbers  of  highly  interactive  —  but  computationally  primitive  —  elements,  ail  working 
In  parallel  to  jointly  determine  the  result  of  the  perceptual  process.  We  begin  by  <fis- 
cusslng  the  properties  of  speech  which  we  feel  demand  a  parallel  interactive  pro¬ 
cessing  system,  and  then  review  previous  attempts  to  model  speech  perception, 
both  psycholfnguistlc  and  machine-based.  We  then  present  the  results  of  a  computer 
simulation  of  one  version  of  an  interactive  activation  model  of  speech,  based  loosely 
on  Marslen-WHson's  COHORT  model.  One  virtue  of  the  model  are  that  It  is  capable  of 
word  recognition  and  phonemic  restoration  without  depending  on  preliminary  segmen¬ 
tation  of  the  input  into  phonemes.  However,  this  version  of  the  model  has  several 
deficiencies  —  among  them  are  excessive  sensitivity  to  speech  rate  and  excessive 
dependence  on  accurate  information  about  the  beginnings  of  words.  To  address 
some  of  these  deficiencies,  we  describe  an  alternative  called  the  TRACE  model.  In 
this  version  of  the  model,  interactive  activation  processes  take  place  within  a  struc¬ 
ture  which  serves  as  a  dynamic  working  memory.  This  structure  permits  the  model  to 
capture  contextual  influences  in  which  the  perception  of  a  portion  of  the  input 
stream  is  influenced  by  what  follows  it  as  well  as  what  preceeds  it  in  the  speech  sig¬ 
nal. 
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INTRODUCTION:  Interactive  Activation  Models 
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Researchers  who  have  attempted  to  understand  higher-level  mental 
processes  have  often  assumed  that  an  appropriate  analogy  to  the  organization  of 
these  processes  In  the  human  mind  was  the  high-speed  digital  computer.  How¬ 
ever,  It  Is  a  striking  fact  that  computers  are  virtually  Incapable  of  handing  the 
routine  mental  feats  of  perception,  language  comprehension,  and  memory  retrieval 
which  we  as  humans  take  so  much  for  granted.  This  difficulty  is  especially 
apparent  in  the  case  of  machine-based  speech  recognition  systems. 

Recently  a  new  way  of  thinking  about  the  kind  of  processing  system  in  which 
these  processes  take  place  has  begun  to  attract  the  attention  of  a  number  of 
Investigators.  Instead  of  thinking  of  the  oognltive  system  as  a  single  high  speed 
processor  capable  of  arbitrarily  complex  sequences  of  operations,  scientists  In 
many  branches  of  cognitive  science  are  beginning  to  think  in  terms  of  alternative 
approaches.  Although  the  details  vary  from  model  to  model,  these  models  usually 
assume  that  Information  processing  takes  place  in  a  system  containing  very  large 
numbers  of  highly  interconnected  units,  each  of  about  the  order  of  complexity  of 
a  neuron.  That  Is,  each  unit  accumulates  excitatory  and  Inhibitory  Inputs  from 
other  units  and  sends  such  signals  to  others  on  the  basis  of  a  fairly  simple 
(though  usually  non- In  ear)  function  of  Its  inputs,  and  adjusts  Its  interconnections 
with  other  units  to  be  more  or  less  responsive  to  particular  inputs  in  the  future. 
Such  models  may  be  caled  Interactive  activation  models  because  processing 
takes  place  In  them  through  the  Interaction  of  large  numbers  of  units  of  varying 
degrees  of  activation,  in  such  a  system,  a  representation  is  a  pattern  of  activity 
dtetributed  over  the  units  in  the  system  and  the  pattern  of  strengths  of  the  inter¬ 
connections  between  the  units.  Processing  amounts  to  the  unfoklng  of  such  a 
representation  In  time  through  excitatory  and  InMbitory  Interactions  and  changes 
In  the  strengths  of  the  interconnections.  The  interactive  activation  model  of 
reading  (McClelland  and  Rumelhart,  1981;  Rumelhart  and  McClelland,  1082)  Is  one 
example  of  this  approach;  a  thorough  survey  of  recent  developments  In  this  field 
Is  available  In  Hkiton  and  Anderson  (1981). 

In  this  chapter  we  wM  demise  research  currently  In  progress  In  our  labora¬ 
tory  at  the  University  of  Calfomla,  San  Diego.  The  goal  of  this  work  is  to  modal 
speech  perception  as  on  Interactive  activation  process.  Research  over  the  peat 
several  decades  has  made  It  abundantly  dear  that  the  speech  signal  Is  extremely 
oomplex  and  rich  In  detail,  it  is  also  dear  from  perceptual  etudes  that  human 
Istanara  appear  able  to  deal  with  this  complexity  and  to  attend  to  the  detail  In 
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ways  which  are  difficult  to  account  for  using  traditional  approaches.  It  is  our 
belief  that  interactive  activation  models  may  provide  exactly  the  sort  of  compu¬ 
tational  framework  which  is  needed  to  perceive  speech.  While  we  make  no  claims 
about  the  neural  basis  for  our  model,  we  do  feel  that  the  model  is  far  more  con¬ 
sistent  vtfth  what  is  known  about  the  functional  neurophysiology  of  the  human 
brain  than  is  the  van  Neumann  machine. 

The  chapter  is  organized  In  the  following  manner.  We  begin  by  reviewing 
relevant  facts  about  speech  acoustics  and  speech  perception.  Our  purpose  is  to 
demonstrate  the  nature  of  the  problem.  We  then  consider  several  previous 
attempts  to  mode)  the  perception  of  speech,  and  argue  that  these  attempts— 
when  they  are  considered  in  any  detail— fail  to  account  for  the  observed 
phenomena.  Next  we  turn  to  our  modeling  efforts.  We  describe  an  early  version  of 
the  model,  and  present  the  results  of  several  stucfies  involving  a  computer  simula¬ 
tion  of  the  model.  Then,  we  consider  shortcomings  of  this  version  of  the  model. 
Finally,  we  descrfoe  ah  alternative  formulation  which  is  currently  being  developed. 
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THE  PROBLEM  OF  SPEECH  PERCEPTION 


There  has  been  a  great  deal  of  research  on  the  perception  of  speech  over 
the  past  several  decades.  This  research  has  succeeded  in  demonstrating  the 
magnitude  of  the  problem  facing  any  attempt  to  model  the  process  by  which 
humans  perceive  speech.  At  the  same  time,  important  cues  about  the  nature  of 
the  process  have  been  revealed.  In  this  section  we  review  these  two  aspects  of 
what  has  been  learned  about  the  problem. 


Why  Speech  Perception  is  Difficult 

"  The  segmentation  problem.  Thera  has  been  considerable  debate  about 
what  the  'units'  of  speech  perception  are.  Various  researchers  have  advanced 
arguments  in  favor  of  diphones  (Klatt,  1980),  phonemes  (Pisoni,  1881),  domisyfl- 
abies  (Fujbnura  &  Levins,  1978),  context-sensitive  alophones  (Wlekelgren, 
1969),  syllables  (Studdert-Kamedy,  1976),  among  others,  as  basic  units  in  per¬ 
ception.  Regardtess  of  which  of  these  proposals  one  favors,  It  nonetheless 
seems  dear  that  at  various  levels  of  processing  there  exist  some  kindCs)  of  unit 
which  have  been  extracted  from  the  speech  signaL  (This  conclusion  appears 
necessary  If  one  assumes  a  generative  capacity  in  speech  perception.)  It  Is 
therefore  usuaty  assumed  that  an  Important  and  appropriate  task  for  speech 
analysis  Is  somehow  to  segment  the  speech  input— to  draw  lines  separating  the 
units. 

The  problem  is  that  whatever  the  units  of  perception  are,  their  boundaries 
are  rarely  evident  in  the  signal  (Zue  &  Schwartz,  1980).  The  information  which 
specifies  a  particular  phoneme  Is  "encoded”  in  a  stretch  of  speech  much  larger 
than  that  which  we  would  normally  say  actually  represents  the  phoneme  (Liber¬ 
man,  Cooper,  Shankweler,  &  Studdert-Kennedy,  1967).  It  may  be  Impossible  to 
say  where  one  phoneme  (or  demlayllabla,  or  word,  etc.)  ends  and  the  next  begins. 

As  a  consequence,  most  systems  begin  to  process  an  utterance  by  attempt¬ 
ing  Mat  Is  ususdy  an  extremely  arrorful  task.  These  errors  give  rise  to  further 
errors  at  later  stages.  A  number  of  strategies  have  evolved  with  the  sale  purpose 
of  recovering  from  Initial  mistakes  in  segmentation  (e-g.,  the  "segment  lattice" 
approach  adopted  by  B0N*a  HWNR  system.  Bolt,  Baranek,  &  Newman,  1 976). 

We  also  feel  that  there  are  unite  of  speech  perception.  However,  It  Is  our 
bePef  that  an  adequate  modal  of  speech  perception  will  be  able  to  accomplish  the 
apparently  paradoxical  task  of  retrieving  these  units  without  ever  axpPdtly  seg¬ 
menting  the  input 


Coorttoutatory  off  sots.  The  production  of  a  given  sound  Is  greatly  affected 
by  the  sounds  which  surround  It  Tide  phenomenon  le  termed  ooarthsulatlon.  he  an 
example,  consider  the  manner  In  wNeh  the  velar  stop  [g]  Is  produced  In  the  words 
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gap  vs.  geese.  In  the  latter  word,  the  place  of  oral  closure  is  moved  forward 
along  the  velum  In  anticipation  of  the  front  vowel  [i].  Similar  effects  have  been 
noted  for  anticipatory  rounding  (compare  the  [s]  in  stew  with  the  [s]  in  steal),  for 
nasalzation  (e^j.,  the  [a]  In  can't  vs.  cat),  and  for  valorization  (e^,  the  [n]  In 
tank  vs.  tenth),  to  name  but  a  few.  Coarticulation  can  also  result  in  the  addHJon  of 
sounds  (consider  the  Intrusive  [t]  in  the  pronunciation  of  tense  as  [tents]. 

We  nave  already  noted  how  coarticuiation  may  make  it  difficult  to  locate 
boundaries  between  segments.  Another  problem  arises  as  well  This  high  degree 
of  context-dependence  renders  the  acoustic  correlates  of  speech  sounds  highly 
variable.  Remarkably,  listeners  rarely  misperceive  speech  In  the  way  we  might 
expect  from  this  variability.  Instead  they  seem  able  to  adjust  their  perceptions  to 
compensate  for  context.  Thus,  researchers  have  routinely  found  that  Isteners 
compensate  for  coartlculatory  effects.  A  few  examples  of  this  phenomenon  fol¬ 
low: 


*  There  Is  a  tendency  In  the  production  of  vowels  for  speakers  to  "undershoot” 
the  target  formant  frequencies  for  the  vowel  (Undblom,  1983).  Thus,  the  possi- 
bWty  arises  that  the  same  formant  pattern  may  signal  one  vowel  in  the  context  of 
a  blabiai  consonant  and  another  vowel  in  the  context  of  a  palatal.  Listeners 
have  been  found  to  adjust  their  perceptions  accordingly  such  that  their  percep¬ 
tion  correlates  with  an  extrapolated  formant  target,  rather  than  the  formant 
values  actually  attained  (Undblom  &  Studdert-Kennedy,  1967).  Oddy,  It  has  been 
reported  that  vowels  in  such  contexts  are  perceived  even  more  accurately  than 
vowels  In  Isolation  (Strange,  Verbrugge,  &  Shankweiler,  1976;  Verbrugge, 
Shankwaier,  &  Fowler,  1976). 

"  The  distinction  between  [s]  and  [ft]  la  based  In  part  on  the  frequency  spectrum 
of  the  frteatlon  (Harris,  1068;  Strevens,  1960),  such  that  when  energy  Is  con¬ 
centrated  in  regions  about  4kHz  an  [a]  Is  heard.  When  there  Is  considerable 
energy  below  this  boundary,  an  [i]  is  heard.  However,  it  is  possible  for  the  spec¬ 
tra  of  both  these  fricatives  to  be  lowered  due  to  coarticuiation  with  a  following 
rounded  vowel.  When  this  occurs,  the  perceptual  boundary  appears  to  shift 
Thus,  the  same  spectrum  wN  be  perceived  as  an  [a]  In  one  case,  and  as  an  [i]  In 
the  other,  depending  on  which  vowel  follows  (Mam  &  Rapp,  1980).  A  preceding 
vowel  has  a  similar  though  amaier  effect  (Hasegawa,  1976) 

*  Ohman  (1966)  has  demonstrated  instances  of  vowel  coarticuiation  across  a 
consonant.  (That  is,  where  the  formant  trajectories  of  the  first  vowel  In  a  VCV 
sequence  are  affected  by  the  non-ad Jacant  second  vowel,  despite  the  Interven¬ 
tion  of  a  consonant)  In  a  series  of  experiments  In  which  such  atbaul  were 
cross  spiced,  Martin  and  Bunnell  (1981)  ware  able  to  show  that  Bstanara  are 
sensitive  to  such  dtetal  coartlculatory  effects. 

*  Rapp  and  Mam  (1981a,  1981b)  have  reported  generaRy  higher  F8  and  F4  onset 
frequencies  for  stops  fotewring  [s]  as  compared  with  stops  which  folaw  [*]. 
Parade!  perceptual  studies  revealed  that  Isteners1  perceptions  varied  in  a  way 
which  was  consistent  vrtth  such  coartlculatory  Influences. 
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*  The  Identical  burst  of  noise  can  cue  perception  of  stops  at  different  places  of 
articulation.  A  noise  burst  centered  at  1440  Hz  followed  by  steady  state  for¬ 
mants  appropriate  to  the  vowels  [IX  [aX  or  [u]  wil  be  perceived  as  [pi  [k],  or 
[pi  respectively  (Liberman,  Deiattre,  &  Cooper,  1 962).  Presumably  this  reflects 
the  manner  in  which  the  vocal  tract  resonances  which  give  rise  to  the  stop  burst 
are  affected  during  production  by  the  following  vowel  (Zue,  1 976). 

*  The  formant  transitions  of  stop  consonants  vary  with  precedtog  Squids  ([r]  and 
pj)  In  a  way  which  Is  compensated  for  by  listeners'  perceptions  (Mann,  1980). 
Given  a  sound  which  is  intermediate  between  [g]  and  [d],  Isteners  are  more  Ikely 
to  report  hearing  a  [g]  when  it  Is  preceded  by  p]  than  by  [r]. 


In  the  above  examples,  it  is  hard  to  be  sure  what  the  nature  of  the  relation 
is  between  production  and  perception.  Are  listeners  accommodating  their  percep¬ 
tion  to  production  dependencies?  Or  do  speakers  modify  production  to  take  into 
account  pecularlaties  of  the  perceptual  system?  Whatever  the  answer,  both  the 
production  and  the  perception  of  speech  Involve  complex  interactions,  and  these 
Interactions  tend  to  be  mirrored  in  the  other  modality. 

Feature  dependencies.  We  have  Just  seen  that  the  manner  In  which  a 
feature  or  segment  is  interpreted  frequently  depends  on  the  sounds  which  sur¬ 
round  It;  this  is  what  Jakobson  (1968)  would  have  called  a  syntagmatic  relation. 
Another  factor  which  must  be  taken  Into  consideration  In  analyzing  features  Is 
what  other  features  co-occur  In  the  same  segment  Features  may  be  resized  in 
different  ways,  depen dng  on  what  other  features  are  present 

If  a  speaker  is  asked  to  produce  two  vowels  with  equal  duration,  amplitude, 
and  fundamental  frequency  (FO),  and  one  has  a  low  tongue  position  (such  as  [a]) 
and  the  other  has  a  high  tongue  position  (e.g.,  [IJ)  the  [a]  will  generally  be  longer, 
louder,  and  have  a  lower  FO  than  the  [Q  (Peterson  &  Barney,  1962).  This  produc- 
tion  dependency  is  mirrored  by  Isteners*  perceptual  behavior.  Despite  physical 
deferences  in  duration,  amplitude,  and  and  FO,  the  vowels  produced  in  the  above 
manner  are  perceived  as  Identical  with  regard  to  these  dimensions  (Chuang  & 
Wang,  1978).  Another  example  of  such  an  effect  may  be  found  in  the  relation¬ 
ship  between  the  place  of  articulation  and  voicing  of  a  stop.  The  perceptual 
threshold  for  voicing  shifts  along  the  VOT  continuum  as  a  function  of  place,  nhnror- 
Ing  a  change  which  occurs  in  production. 

In  both  these  examples,  the  inter  action  is  between  feature  and  Intre- 
aepmantal  context,  rather  than  between  feature  and  trans-segmental  context. 


Trading  relation*.  A  single  articulatory  event  may  give  rise  to  multiple 
acoustic  cues.  This  is  the  case  with  voicing  in  Initial  stops,  in  articulatory  terms, 
voicing  Is  Indicated  by  the  magnitude  of  (VOT).  VOT  ref  era  to  the  temporal  offset 
between  onset  of  (pottal  pulsing  and  the  releasa  of  the  stop.  This  apparently  sim¬ 
ple  event  has  complex  aooustic  consequences.  Among  other  cues,  the  following 
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provide  evidence  for  the  VOT :  (1)  presence  or  absence  of  first  formant  (FI  cut¬ 
back)!  (2)  voiced  transition  duration,  (3)  onset  frequency  of  FI ,  (4)  ampfftude  of 
burst,  and  (5)  FO  onset  contour.  Lisker  (1957,  1978)  has  provided  an  even  more 
extensive  catalogue  of  cues  which  are  available  for  determing  the  voicing  of 
stops  in  intervocalic  position. 

In  cases  such  as  the  above,  where  multiple  cues  are  associated  with  a 
phonetic  dtotinction,  these  cues  exhibit  what  have  been  called  "trading  relations" 
(see  Rapp,  1981,  for  review).  Presence  of  one  of  the  cues  in  greater  strength 
My  compensate  for  absence  or  weakness  of  another  cue.  Such  perceptual 
dependencies  have  been  noted  for  the  cues  which  signal  place  and  manner  of 
articulation  in  stops  (MMer  &  Bmas,  1977;  Oden  &  M  assarts,  1978;  Mass  arc  & 
Oden,  1 980a,b;  Alfonso,  1981),  voicing  in  fricatives  (Derr  &  Massaro,  1980;  Mas- 
saro  &  Cohen,  1 976);  the  fricative/affricate  distinction  (Repp,  Liberman,  Eccardt, 
&  Pesetsky,  1 978),  among  many  others. 

As  Is  the  case  with  contextualy  governed  dependencies,  the  net  effect  of 
trading  relations  Is  that  the  value  of  a  given  cue  can  not  be  known  absolutely. 
The  listener  must  Integrate  across  al  the  cues  which  are  available  to  signal  a 
phonetic  distinction;  the  significance  of  any  given  cue  Interacts  with  the  other 
cues  which  are  present. 

Rate  dependencies.  The  rate  of  speech  normally  may  vary  over  the  duration 
of  a  single  utterance,  as  well  as  across  utterances.  The  changes  in  rate  affect 
the  dynamics  of  the  speech  signal  In  a  complex  manner.  In  general,  speech  is 
compressed  at  higher  rates  of  speech,  but  some  segments  (vowels,  for  example) 
are  compressed  relatively  more  than  others  (stops).  Furthermore,  the  boundaries 
between  phonetic  distinctions  may  change  as  a  function  of  rate  (see  Miller,  1 981 
for  an  excellent  review  of  this  literature). 

One  of  the  cues  which  distinguishes  the  stop  in  [ba]  from  the  glide  In  [wa]  is 
the  duration  of  the  consonantal  transition.  At  a  medium  rate  of  speech  a  transition 
of  less  than  approximately  50  ms.  causes  listeners  to  perceive  stops.  (Liberman, 
KMattre,  Gerstman,  &  Cooper,  1968).  Longer  durations  signal  glides  (but  at  very 
long  durations  the  transitions  indicate  a  vowel).  The  location  of  this  boundary  Is 
affected  by  rate  changes;  It  shifts  to  shorter  values  at  faster  rates  (Minifie,  Kuht, 
&  Stecher,  1 978;  MUer  &  Liberman,  1 979). 

A  large  number  of  other  important  dhrtinctions  are  affected  by  the  rate  of 
speech.  These  Include  voicing  (Summerfkrid,  1974),  vowel  quality  (Lindbiom  & 
Studdert-Kennedy,  1987;  Verbrugge  &  Shankweiler,  1 977),  fricative  va.  affricate 
(although  these  findings  are  somewhat  paradoxical,  Dorman,  Raphael,  &  Liberman, 
1970). 


Phonological  effects,  in  addition  to  the  above  sources  of  variabflity  in  the 
speech  signal!  consider  the  following  phenomena. 


i 

I 

i 
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In  English,  voiceless  stop  consonants  are  produced  with  aspiration  in 
sytabie-Mtiai  position  (as  In  [p"])  but  not  when  they  fdow  an  [s]  (as  in  [sp]).  In 
many  anvlrorsaanta,  a  sequence  of  an  alveolar  stop  fallowed  by  a  palatal  glide  is 
replaced  by  an  alveolar  palatal  affricate,  so  that  did  you  la  pronounced  as  [diju). 
Also  in  many  (Selects  of  American  (but  not  British)  Englsh,  voiceless  alveolar 
stops  are  'flapped1  IntarvocaRcaly  following  a  stressed  vowel  (pretty  being  pro¬ 
nounced  as  [pnfX]).  Some  phonological  processes  may  delete  aegawnts  r  -  even 
entire  syllables;  vowels  In  unstressed  sylables  may  thus  be  either  "red)  or 
deleted  altogether,  as  In  policeman  [plamon]. 

The  above  examples  Mustrate  phonological  processes.  These  opera  -riien 
certain  sounds  appear  In  specific  environments.  In  many  respects,  they  ike 
the  contextually-governed  and  coarticulatory  effects  described  above  ./  it 
times  the  distinction  Is  In  fact  not  dear).  Phonological  changes  are  r.  rely 
high-level.  That  is,  they  are  often  (although  not  always)  under  speaker  control. 
The  pronunciation  of  pretty  as  [pnOi]  Is  typical  of  rapid  conversational  speech, 
but  If  a  speaker  Is  asked  to  pronounce  the  word  very  slowly  emphasizing  the 
separate  sytablse,  he  or  she  sN  say  [pci-r’l].  Many  times  these  processes  are 
entirely  optional;  this  Is  ganeraNy  the  case  with  deletion  rules.  Other  phonological 
rules  (e.g^  adophonic  rules)  are  usually  oblgatory.  This  Is  true  of  syfiabie-initial 
voiceless  stop  aspiration. 

Phonological  rides  vary  across  languages  and  even  across  dialects  and 
speech  styles  of  the  same  language.  They  represent  an  Important  source  of 
knowledge  Isteners  have  about  their  language,  ft  Is  dear  that  the  successful 
perception  of  speech  reiee  hesvffy  on  phonological  knowledge. 


»  *  *  «  K 


These  are  but  a  few  of  the  difficulties  which  are  presented  to  speech  per- 
ceivers.  It  should  be  evident  that  the  task  of  the  listener  Is  far  from  trivial.  There 
are  several  points  which  are  worth  making  exploit  before  proceeding. 

First,  the  observations  above  lead  us  to  the  following  generalization.  There 
ere  an  extremely  large  number  of  factors  which  converge  during  the  production  of 
speech.  These  factors  Interact  In  complex  ways.  Any  given  sound  can  be  con¬ 
sidered  to  Re  at  the  nexus  of  these  factors,  and  to  reflect  their  Interaction.  The 
process  of  perception  must  somehow  be  adapted  to  unravelng  these  Interactions. 

Second,  as  variable  as  the  speech  signal  Is,  that  varlabfflty  is  lawful.  Some 
■oMi  of  coMch  nirmtlon  tnd  nut  nonooh  rocowMon  nvntnm  find  to  vlmv 
the  speech  signal  as  a  lighly  degraded  Input  with  a  low  slgnal/ndse  ratio.  Tide  Is 
an  unfortunate  oonduslan.  The  variability  Is  more  property  regiuded  as  the  result 
of  the  peraBsl  trenssdeelon  of  hfornattun.  This  peraSei  transmission  proddes  a 
high  degree  of  redundancy.  The  signal  Is  accordingly  complex,  but— If  It  Is 
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analyzed  correctly— it  is  also  extremely  robust.  This  leads  to  the  next  conclu¬ 
sion. 


Third,  rather  than  searching  for  acoustic  invariance  (either  through 
reanalysis  of  the  signal  or  proliferation  of  context-sensitive  units)  we  might  do 
better  to  look  for  ways  In  which  to  take  advantage  of  the  rule-governed  variabil¬ 
ity.  We  maintain  that  the  difficulty  which  speech  perception  presents  is  not  how 
to  reconstruct  an  impoverished  signal;  it  is  how  to  cope  with  the  tremendous 
amount  of  information  which  is  avatable,  but  which  is  (to  use  the  term  proposed 
by  Liberman  et  al.,  1967)  highly  encoded.  The  problem  is  lack  of  a  suitable  compu¬ 
tational  framework. 


Clues  About  the  Nature  of  the  Process 

The  facts  reviewed  above  provide  important  constraints  on  models  of  speech 
perception.  That  is,  any  successful  model  wfH  need  to  account  of  those 
phenomena  In  an  explicit  way.  In  addition,  the  following  additional  facts  should  be 
accounted  for  in  any  model  of  speech  perception. 

High-level  knowledge  Interacts  with  low-level  decisions .  Decisions  about 
the  acoustic/phonetic  identify  of  segments  are  usually  considered  to  be  low- 
leveL  Decisions  about  questions  such  as  "What  word  am  hearing?"  or  "What 
clause  does  this  word  belong  to?"  or  "What  are  the  pragma'Jc  properties  of  this 
utterance?"  are  thought  of  as  high-level  In  many  other  models  of  speech  percep¬ 
tion,  these  decisions  are  answered  at  separate  stages  in  the  process,  and  tnese 
stages  interact  minimally  and  often  only  Indirectly;  ax  best,  the  interactions  are 
bottom-up.  Acoustic / phonetic  decisions  may  supply  information  for  determining 
word  identity,  but  word  identification  has  little  to  do  with  acoustic/phonetic  pro¬ 
cessing. 

We  know  now,  however,  that  speech  perception  involves  extensive  interac¬ 
tions  between  levels  of  processing,  and  that  top-down  effects  are  as  significant 
as  bottom-up  effects. 

For  instance,  Ganong  (1 980)  has  demonstrated  that  the  lexical  Identity  of  a 
stimulus  can  affect  the  decision  about  whether  a  stop  consonant  is  voiced  or 
voiceless.  Ganong  found  that,  given  a  continuum  of  stimuli  which  ranged  perceptu¬ 
ally  from  gift  to  A/ft,  the  voiced/voiceless  boundary  of  his  subjects  was  <9s- 
piaced  toward  the  voiced  end ,  compared  with  simflar  decisions  involving  stimuli 
along  a  g/ss  -  kiss  continuum.  The  low-level  decision  regarding  voicing  thus 
interacted  with  the  hHJHevel  lexical  decision. 

In  a  slmflar  vein,  Isenberg,  Walker,  Ryder,  &  Schweickert  (1980)  found  that 
the  perception  of  a  consonant  as  being  a  stop  or  a  fricative  interacted  with  prag¬ 
matic  aspects  of  the  sentence  in  which  ft  occurred,  in  one  of  the  experiments 

reported  by  Isenberg  et  al,  subjects  heard  two  sentence  frames;  /  like _ Joke 

and  /  like _ drive.  The  target  skit  contained  a  stimulus  which  was  drawn  from  a 
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to  -  the  continuum  (actually  realized  as  [to]  -  [do],  with  successive  attenuation 
of  the  amplitude  of  the  burst  +  aspiration  Interval  cueing  the  stop  /fricative  ex¬ 
tinction).  For  both  frames  to  as  well  as  the  result  in  grammatical  sentences.  How¬ 
ever,  Joke  is  more  often  used  as  a  noun,  whereas  drive  occurs  more  often  as  a 
verb.  Listeners  tended  to  hear  the  consonant  In  the  way  which  favored  the  prag¬ 
matically  plausible  Interpretation  of  the  utterance  This  was  reflected  as  a  shift 

in  the  phoneme  boundary  toward  the  [t]  end  of  the  continuum  for  the  /  like _ 

joke  items,  and  toward  the  [d]  end  for  the  /  like _ drive  items. 

The  role  of  phonological  knowledge  in  perception  has  been  illustrated  in  an 
experiment  by  Massaro  and  Cohen  (1980).  Listeners  were  asked  to  identify 
sounds  from  a  [fl]-[ri]  continuum  (where  stimuli  cflffered  as  to  the  onset  frequency 
of  F3).  The  syllables  were  placed  after  each  of  four  different  consonants;  some 
of  the  resulting  sequences  were  phonotactically  permissible  in  English  but  others 
were  not.  Massaro  and  Cohen  found  that  the  boundary  between  [I]  and  [r]  varied 
as  a  function  of  the  preceding  consonant.  Listeners  tended  to  perceive  fll  for 
example,  when  It  was  proceeded  by  an  [s],  since  [#sl]  is  a  legal  sequence  in 
English  but  [#sr]  is  not.  On  the  other  hand,  [r]  was  favored  over  [I]  when  it  fol¬ 
lowed  [t]  since  EngXh  permits  [#tr}  but  not  [#tQ. 

Syntactic  decisions  also  interact  with  acoustic/phonetic  processes.  Cooper 
and  his  colleagues  (Cooper,  1980;  Cooper,  Pacda,  &  Lapointe,  1978;  Cooper  & 
Paccia-Cooper,  1980)  have  reported  a  number  of  instances  in  which  rather  subtle 
aspects  of  the  speech  signal  appear  to  be  affected  by  syntactic  properties  of 
the  utterance.  These  include  adjustments  in  the  fundamental  frequency,  duration, 
and  the  blocking  of  phonological  rules  across  certain  syntactic  boundaries.  While 
these  studies  are  concerned  primarily  with  aspects  of  production,  we  might  sur¬ 
mise  from  previous  cases  where  perception  mirrors  production  that  listeners  take 
advantage  of  such  cues  in  perceiving  speech. 

Not  only  the  accuracy,  but  also  the  speed  of  making  low-level  decisions 
about  speech.  Is  influenced  by  higher-level  factors.  Experimental  support  for  this 
view  is  provided  by  data  reported  by  Marslen-Wilson  and  Welsh  (1978).  In  their 
study  subjects  were  asked  to  shadow  various  types  of  sentences.  Some  of  the 
utterances  consisted  of  syntactics #y  and  semantically  well-formed  sentences. 
Other  utterances  were  syntactical  correct  but  semantically  anomokxis.  A  third 
class  of  utterances  was  both  syntactically  and  semantically  ungrammatical. 
Marslen-Wilson  and  Welsh  found  that  shadowing  latencies  varied  with  the  type  of 
utterance.  Subjects  shadowed  the  syntactically  and  semantically  well-formed 
prose  most  quickly.  Syntactically  correct  but  meaningless  utterances  were  sha¬ 
dowed  less  well.  Random  sequences  of  words  were  shadowed  most  poorly  of  al. 
These  results  indicate  that  even  when  acoustic/phonetic  analysis  is  possfcle  In 
the  absence  of  higher- level  information,  this  analysis— at  least  as  required  for 
purposes  of  shadowing— seems  to  be  aided  by  syntactic  and  semantic  support 

A  final  example  of  how  high-level  knowledge  interacts  with  low-level  deci¬ 
sions  comes  from  a  study  by  Elman,  Diehl,  &  Buchwald  (1977).  This  study  Illus¬ 
trates  how  phonetic  categorization  depends  on  language  context  ('What 
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language  am  I  listening  to?*')*  Elman  et  aL  constructed  stimulus  tapes  which  con¬ 
tained  a  number  of  naturally  produced  one-syttable  Items  which  followed  a  precur¬ 
sor  sentence.  Among  the  items  were  the  nonsense  syllables  [ba]  or  [pa],  chosen 
so  that  several  syllables  had  stop  VOT  values  ranging  from  0  ms.  to  40  ms.  (in 
addition  to  others  with  more  extreme  values). 

Two  tapes  were  prepared  and  presented  to  subjects  who  were  bWngual  in 
Spanish  and  English.  On  one  of  the  tapes,  the  precursor  sentence  was  “Write  the 
word...";  the  the  other  tape  contained  the  Spanish  translation  of  the  same  sen¬ 
tence.  Both  tapes  contained  the  same  [ba]  and  [pa]  nonsense  stimuli.  Subjects 
Bstened  to  both  tapes;  for  the  Spanish  tape  in  which  alt  experimental  materials 
and  instructions  were  in  Spanish;  the  EngRsh  tape  was  heard  In  an  English  con¬ 
text. 


The  result  was  that  subjects1  perceptions  of  the  same  [ba]/[pa]  stimuli 
varied  as  a  function  of  context.  In  the  Spanish  condition,  the  phoneme  boundary 
was  located  in  a  region  appropriate  to  Spanish  (i.e.,  near  0  ms.)  while  In  the 
English  condition  the  boundary  was  correct  for  English  (near  30  ms.). 

One  of  the  useful  lessons  of  this  experiment  comes  from  a  comparison  of  the 
results  with  previous  attempts  to  induce  perceptual  shifts  in  bilinguals.  Earlier 
stucfles  had  failed  to  obtain  such  language-dependent  shifts  in  phoneme  boundary 
(even  though  blinguals  have  been  found  to  exhS>it  such  shifts  In  production). 
Elman  et  al.  suggested  that  the  previous  failures  were  due  to  inadequate  pro¬ 
cedures  for  establishing  language  context.  These  included  a  mismatch  between 
context  (natural  speech)  and  experimental  stimuli  (synthetic  speech).  Contextual 
variables  may  be  potent  forces  in  perception,  but  the  conditions  under  which  the 
Interactions  occur  may  also  be  very  precisely  and  narrowly  defined. 

Pel  lance  on  lexical  constraints.  Even  in  the  absence  of  syntactic  or  seman¬ 
tic  structure,  lexical  constraints  exert  a  powerful  influence  on  perception;  words 
are  more  perceptible  than  nonwords  (Rubin,  Turvey,  &  VanQelder,  1976).  Indeed, 
this  word  advantage  is  so  strong  that  listeners  may  even  perceive  missing 
phonemes  as  present,  provided  the  result  yields  a  real  word  (Warren,  1970; 
Samuel,  1979).  Samuel  (1980)  has  shown  that  if  a  missing  phoneme  could  be 
restored  in  several  ways  (e.g.,  teUfon  could  be  restored  either  as  legion  or 
lesion ),  then  restoration  does  not  occur. 

Speech  perception  occurs  rapidly  end  in  one  pass.  In  our  view,  an  extremely 
important  fact  about  human  speech  perception  is  that  it  occurs  in  one  pass  and  in 
real  time.  Marslen-WUaon  (1975)  has  shown  that  speakers  are  able  to  shadow 
(repeat)  prose  at  very  short  latencies  (e.g.,  250  ms.,  roughly  equal  to  a  one  syll¬ 
able  delay).  In  many  cases,  listeners  are  able  to  recognize  and  begin  producing  a 
word  before  It  has  been  completed.  This  Is  especially  true  once  a  portion  of  a 
word  has  been  heard  which  la  sufficient  to  uniquely  determine  the  Identity  of  the 
word.  This  abiity  of  humans  to  process  In  real  time  stands  in  stark  contrast  to 
machine-based  recognition  systems. 
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Context  effects  get  stronger  toward  the  ends  of  words.  Word  endings  appear 
to  be  more  susceptible  to  top-down  effects  than  word  endings.  Put  differently, 
Isteners  appear  to  rely  on  the  acoustic  input  less  and  less  as  more  of  a  word  is 
heard. 

Marslen-WUson  and  Welsh  (1978)  found  that  when  subjects  were  asked  to 
shadow  prose  in  which  errors  occurred  at  various  locations  in  words*  the  subjects 
tended  to  restore  (Le^  correct)  the  error  more  often  when  the  error  occurred  in 
the  third  syllable  of  a  word  (63%)  than  In  the  first  syllable  (45%).  Cole,  JakJmik, 
&  Cooper  (1 978)  have  reported  slmBar  fknfings.  On  the  other  hand.  If  the  task  is 
changed  to  error  detection,  as  in  a  study  by  Cole  and  Jakknik  (1978),  and  we 
measure  reaction  time,  we  find  that  subjects  detect  errors  faster  In  final  syll¬ 
ables  than  In  initial  syiables. 

Both  sets  of  results  are  compatible  with  the  assumption  that  word  percep¬ 
tion  involves  a  narrowing  of  possible  candidates,  As  the  beginning  of  a  word  is 
heard,  there  may  be  many  posslbitties  as  to  what  could  fallow.  Lack  of  a  lexical 
bias  would  lead  subjects  to  repeat  what  they  hear  exactly.  They  would  also  be 
slower  In  detecting  errors,  since  they  would  not  yet  know  what  word  was 
intended.  As  more  of  the  word  is  heard,  the  candkfates  for  word  recognition  are 
narrowed.  In  many  cases,  a  single  possibttty  will  emerge  before  the  and  of  the 
word  has  been  presented.  This  knowledge  interacts  with  the  perceptual  process 
so  that  less  bottom-up  information  is  required  to  confirm  that  the  expected  word 
was  heard.  In  some  cases,  even  errors  may  be  missed.  At  the  same  time,  when 
errors  are  detected,  detection  latency  wffl  be  relatively  fast.  This  is  because  the 
Istener  now  knows  what  the  intended  word  was. 
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PREVIOUS  MODELS  OF  SPEECH  PERCEPTION 


One  can  distinguish  two  general  classes  of  models  of  speech  perception 
which  have  been  proposed.  On  the  one  hand  we  find  models  which  claim  to  have 
some  psycholinguistic  validity,  but  which  are  rarely  specified  in  detail.  And  on  the 
other  hand  are  machine-based  speech  understanding  systems;  these  are  neces¬ 
sarily  more  explicit  but  do  not  usually  claim  to  be  psychological  valid. 

Psycholinguistic  models.  Most  of  the  psycholinguistic  models  lack  the  kind 
of  data!  which  would  make  it  possible  to  test  them  empirically.  It  would  be  dtfft- 
cult,  for  example,  to  develop  a  computer  simulation  in  order  to  see  how  the  models 
would  work  given  real  speech  input. 

Some  of  the  models  do  attempt  to  provide  answers  to  the  problems  men¬ 
tioned  in  the  previous  section.  Mass&ro  and  his  colleagues  (Massaro  &  (Men, 
1980a,  1980b;  Oden  &  Massaro,  1978;  Massaro  &  Cohen,  1977)  have  recog¬ 
nized  the  significance  of  interactions  between  features  in  speech  perception. 
They  propose  that,  while  acoustic  cues  are  perceived  independently  from  one 
another,  these  cues  are  integrated  and  matched  against  a  propositional  prototype 
lor  each  speech  sound.  The  matching  procedure  involves  the  use  of  fuzzy  logic 
(Zadeh,  1972).  In  this  way  their  model  expresses  the  generalization  that 
features  frequently  exhibit  ’’trading  relations”  with  one  another.  The  model  is  one 
of  the  few  to  be  formulated  In  quantitative  terms,  and  provides  a  good  fit  to  the 
data  Massaro  and  his  co-workers  have  collected.  However,  while  we  value  the 
descriptive  contribution  of  this  approach,  it  fails  to  provide  an  adequate  state¬ 
ment  of  the  mechanisms  required  for  perception  to  occur. 

Cole  and  JakMk  (1 978,  1980)  have  also  addressed  many  of  the  same  con¬ 
cerns  vtfiich  have  been  identified  here.  Among  other  problems,  they  note  the  dfff- 
cufty  of  segmentation,  the  fact  that  perception  is  sensitive  to  the  position  within 
a  word,  and  that  context  plays  an  Important  role  in  speech  perception.  Unfor¬ 
tunately,  their  observations— while  insightful  and  well-substantiated— have  not 
yet  led  to  what  might  be  considered  a  real  model  of  how  the  speech  percefver 
solves  these  problems. 

The  approach  with  which  we  find  ourselves  in  greatest  sympathy  Is  that 
taken  by  Marslen-WRson  (MarslerMMIson,  1975,  1980;  MarsJen-VMIson  &  Tyler, 
1975;  MarslerHMlson  &  Welsh,  1978).  Marslen-Wllson  has  described  a  model 
which  Is  similar  In  spirit  to  Morton's  (1979)  iogogen  model  and  which  emphasizes 
the  paralel  and  Interactive  nature  of  speech  perception. 

In  Marslen-Wlson's  model,  words  are  represented  by  active  entities  which 
look  much  tike  logogens.  Each  word  element  is  a  type  of  evidence-gathering 
entity;  It  searches  the  Input  for  indcations  that  It  is  present  These  elements 
cflffer  from  logogens  In  that  they  are  able  to  respond  actively  to  mismatches  in 
the  signal.  Thus,  while  a  large  class  of  word  elements  might  become  active  at  the 
beginning  of  an  input,  as  that  input  continues  many  of  the  words  will  be 
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dsconfirmed  and  will  remove  themselves  from  the  pool  of  word  candidates.  Even- 
tualy  only  a  single  word  wfll  remain.  At  this  point  the  word  Is  perceived.  Marslen- 
WRaon’s  basic  approach  is  attractive  because  it  accounts  for  many  aspects  of 
speech  perception  which  suggest  that  processing  Is  carried  out  in  parallel.  While 
the  model  Is  vague  or  fails  to  address  a  number  of  important  Issues,  it  is  attrac¬ 
tive  enough  so  that  we  have  used  it  as  the  basis  for  our  initial  attempt  to  build  an 
interactive  model  of  speech  perception.  We  will  have  more  to  say  about  this 
model  presently. 

A  number  of  other  speech  perception  models  have  been  proposed,  inducing 
those  of  Pisoni  &  Sawusch  (1976),  Cooper  (1979),  Liberman,  Cooper,  Harris,  & 
MacNeMage  (1962),  and  HaUe  &  Stevens  (1964),  and  many  of  these  proposals 
provide  partial  solutions  to  the  problem.  For  Instance,  whfle  there  are  serious  dif¬ 
ficulties  with  a  strong  formulation  of  the  Motor  Theory  of  Speech  Perception 
(Liberman  et  a!.,  1962),  this  theory  has  focused  attention  on  an  important  fact. 
Many  of  the  phenomena  which  are  observed  in  an  acoustic  analysis  of  speech 
appear  to  be  puzzftng  or  arbitrary  until  one  understands  their  articulatory  founda¬ 
tion.  There  is  good  reason  to  believe  that  speech  perception  involves— if  not 
necessarily  (MacNeUage,  Bootes,  &  Chase,  1967)  at  least  preferably— implidt 
knowledge  of  the  mapping  between  articulation  and  sound,  it  may  wel  be,  as 
some  have  suggested  (Studdert-Kennedy,  1982)  that  speech  perception  is  best 
understood  as  event  perception ,  that  event  being  speech  production. 

Despite  insights  such  as  these,  we  feel  that  previous  models  of  speech  per¬ 
ception  have  serious  deficiencies. 

First,  these  models  are  almost  never  formulated  with  sufficient  detail  that 
one  can  make  testable  predictions  from  them.  Second,  many  of  them  simply  fail  to 
address  certain  critical  problems.  For  example,  few  models  provide  any  account 
for  how  the  units  of  speech  (be  they  phonemes,  morphemes,  or  words)  ere  identi¬ 
fied  given  input  in  which  unit  boundaries  are  almost  never  present.  Nor  do  most 
models  explain  how  listeners  are  able  to  unravel  the  encoding  caused  by  coarti¬ 
culation. 

Whle  we  find  the  greatest  agreement  with  Marslen-Wllson's  approach,  there 
are  a  number  of  significant  questions  his  model  leaves  unanswered.  (1)  How  do 
the  word  elements  know  when  they  match  the  input?  The  failure  of  many 
machine-based  speech  recognition  systems  indicates  this  is  far  from  trivial  prob¬ 
lem.  (2)  Do  word  elements  have  internal  structure?  Do  they  encode  phonemos 
and  morphemes?  (3)  How  is  serial  order  (of  words,  phonemes,  morphemes,  etc.) 
represented?  (4)  How  do  we  recognize  nonwords?  Must  we  posit  a  separate 
mechanism,  or  Is  there  some  way  in  which  the  same  mechanism  can  be  used  to 
perceive  both  words  and  nonwords?  (6)  How  is  multi-word  input  perceived?  What 
happens  when  the  Input  may  be  parsed  in  severs)  ways,  either  as  one  long  word 
or  several  smaller  words  (e.g.,  sell  ya  light  vs.  cellulite )?  These  are  ai  important 
questions  which  are  not  addressed. 
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Machine-based  models*  It  might  seem  unfair  to  evaluate  machine-based 
speech  recognition  systems  as  models  of  speech  perception,  since  most  of  them 
do  not  purport  to  be  such.  But  as  Norman  (1080)  has  remarked  in  this  context, 
"nothing  succeeds  Nice  success.'*  The  perceived  success  of  several  of  the 
speech  understarufog  systems  to  grow  out  of  the  ARPA  Speech  Understanding 
Research  project  (see  Klatt,  1 077,  for  review),  has  had  a  profound  influence  on 
the  field  of  human  speech  perception.  As  a  result,  several  recent  models  have 
been  proposed  (e.g.,  Klatt,  1080;  Newel,  1080)  which  do  claim  to  model  human 
speech  perception,  and  whose  use  of  pre- compiled  knowledge  and  table  look-up 
Is  explcftty  Justified  by  the  success  of  the  machine-based  models.  For  these 
reasons,  we  think  the  machine-based  systems  must  be  considered  seriously  as 
models  of  human  speech  perception. 

The  two  best  known  attempts  at  machine  recognition  of  speech  are  HEAR¬ 
SAY  and  HARPY. 

HEARSAY  (Erman  &  Lesser,  1080;  Camegle-MeBon,  1077)  was  the  more 
expffcttfy  psychologically-oriented  of  the  two  systems.  HEARSAY  proposed 
several  computationally  (Satinet  knowledge  sources,  each  of  which  could  operate 
on  the  same  structured  data  base  representing  hypotheses  about  the  contents  of 
a  temporal  window  of  speech.  Each  knowledge  source  was  supposed  to  work  in 
paraiel  with  the  others,  taking  Information  from  a  central  "blackboard"  as  it 
became  avalable,  suggesting  new  hypotheses,  and  revising  the  strengths  of  oth¬ 
ers  suggested  by  other  processing  levels. 

Although  conceptually  attractive,  HEARSAY  was  not  a  computationally  suc¬ 
cessful  model  (in  the  sense  of  satisfying  the  ARPA  SUR  project  goals,  Klatt, 
1077),  and  there  are  probably  a  number  of  reasons  for  this.  One  central  reason 
appeared  to  be  the  sheer  amount  of  knowledge  that  had  to  be  brought  to  bear  in 
comprehension  of  utterances— even  of  utterances  taken  from  a  very  highly  con¬ 
strained  domain  such  as  the  specification  of  chess  moves.  Knowledge  about 
what  acoustic  properties  signaled  which  phonemes,  which  phonemes  might  occur 
together  and  how  those  co-occurrances  condition  the  acoustic  properties, 
knowledge  of  which  sequences  of  speech  sounds  made  legal  words  in  the  res¬ 
tricted  language  of  the  system,  knowledge  about  syntactic  and  semantic  con¬ 
straints,  and  knowledge  about  what  it  made  sense  to  say  In  a  particular  context 
had  to  be  accessible.  The  machinery  avaNable  to  ICARSAY  (and  by  machinery  we 
mean  the  entire  computational  approach,  not  simply  the  hardware  available)  was 
simply  not  sufficient  to  bring  al  of  these  considerations  to  bear  In  the  comprehen¬ 
sion  process  in  anything  close  to  real  time. 

Three  other  problems  may  have  been  the  fact  that  the  analysis  of  the 
acoustic  Input  rarely  resulted  in  unambiguous  identification  of  phonemes;  the  dif¬ 
ficulties  in  choosing  between  which  hypotheses  would  most  profitably  be  pursued 
first  (the  "focus  of  attention"  problem);  and  the  fact  that  the  program  was  corn- 
mined  to  the  notion  that  the  speech  Input  had  to  be  segmented  into  separate 
phonemes  for  identification.  This  was  a  very  errorful  process.  We  win  argue  that 
this  step  may  be  unnecessary  In  a  sufficiently  parallel  mechanism. 
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The  difficulties  faced  by  the  HEARSAY  project  with  the  massive  paralel 
computation  that  was  required  for  successful  speech  processing  were  avoided  by 
the  HARPY  system  (Lowerre  &  Reddy,  1080;  Camegfe-MeBon,  1077).  HARPY’S 
main  advantage  over  HEARSAY  was  that  the  various  constraints  used  by  HEAR¬ 
SAY  in  the  process  of  interpreting  an  utterance  were  pre-complled  into  HARPY'S 
computational  structure,  which  was  an  integrated  network.  This  meant  that  the 
extreme  slowness  of  HEARSAY'S  processing  could  be  overcome;  but  at  the 
expense.  It  turned  out,  of  an  extremely  long  compilation  time  (over  12  hours  of 
time  on  a  DEC-10  computer).  This  trick  of  compilng  in  the  knowledge,  together 
with  HARPY’S  Incorporation  of  a  more  sophisticated  acoustic  analysis,  and  an  effi¬ 
cient  graph-searching  technique  for  pruning  the  network  ("beam  search*’),  made  It 
possible  for  this  system  to  achieve  the  engineering  goals  estabished  for  it. 

However,  HARPY  leaves  us  at  a  dead  end.  Its  knowledge  Is  frozen  Into  its 
structure  and  there  is  no  natural  way  for  knowledge  to  be  added  or  modified.  It  is 
extremely  uniVceiy  that  the  simplified  transition  network  formalsm  underlying 
HARPY  can  actually  provide  an  adequate  formal  representation  of  the  structure  of 
language  or  the  flexibility  of  Its  potential  use  in  real  contexts. 


St  SI 


SI  II 


Both  the  psycholinguists  and  the  machine  models  share  certain  fundamental 
assumptions  about  how  the  processing  of  speech  is  best  carried  out.  These 
assumptions  derive,  we  feel,  from  the  belief  that  the  van  Neumann  digital  com¬ 
puter  Is  the  appropriate  metaphor  for  Information  processing  in  the  brain.  This 
metaphor  suggests  that  processing  is  carried  out  as  a  series  of  operations,  one 
operation  at  a  time;  that  these  operations  occur  at  high  speeds;  and  that 
knowledge  Is  stored  in  random  locations  (as  In  Random  Access  Memory)  and  must 
be  retrieved  through  some  search  procedure.  These  properties  give  rise  to  a 
characteristic  processing  strategy  consisting  of  iterated  hypothesize-and-test 
bops.  (It  Is  curious  that  even  in  the  case  of  HEARSAY,  which  came  closest  to 
escaping  the  van  Neumann  architecture,  the  designers  were  unwffHng  to  abandon 
this  fundamental  strategy.) 

Yet  we  note  again  how  poorly  this  metaphor  has  served  In  developing  a  model 
for  human  speech  perception.  Let  us  now  consider  an  alternative. 


i 
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THE  MTERACTIVE  ACTIVATION  MODEL  OF  SPEECH  PERCEPTION 


Die  Philosophy  Underlying  the  Present  Model 

In  contrast  to  HARPY  and  HEARSAY,  we  do  not  believe  that  It  is  reasonable 
to  work  toward  a  computational  system  which  can  actually  process  speech  in  real 
time  or  anything  close  to  It.  The  necessary  parallel  computational  hardware  sim¬ 
ply  does  not  exist  for  this  task.  Rather,  we  believe  that  It  will  be  more  profitable 
to  work  on  the  development  of  parallel  computational  mechanisms  which  seem  in 
principle  to  be  capable  of  the  actual  task  of  speech  perception,  given  sufficient 
elaboration  in  the  right  kind  of  hardware,  and  to  explore  them  by  running  neces¬ 
sarily  slow  simulations  of  massively  parallel  systems  on  the  avaHable  computa¬ 
tional  tools.  Once  we  understand  these  computational  mechanisms,  they  can  be 
embodied  in  dedicated  hardware  spedatty  designed  and  Implemented  through  very 
large  scale  integration  (VLSI). 

Again  in  contrast  to  HARPY  and  HEARSAY,  we  wish  to  develop  a  model  which 
is  consistent  with  what  we  know  about  the  psychology  and  physiology  of  speech 
perception.  Of  course  this  is  sensible  from  a  point  of  view  of  theoretical  psychol¬ 
ogy.  We  believe  It  Is  also  sensible  from  the  point  of  view  of  designing  an  ade¬ 
quate  computational  mechanism.  The  only  existing  computational  mechanism  that 
can  perceive  speech  is  the  human  nervous  system.  Whatever  we  know  about  the 
human  nervous  system,  both  at  the  physiological  and  psychological  levels,  pro¬ 
vides  us  with  useful  clues  to  the  structure  and  the  types  of  operations  of  one 
computational  mechanism  which  is  successful  at  speech  perception. 

We  have  already  reviewed  the  psychological  constraints,  in  considering  rea¬ 
sons  why  the  problem  of  speech  perception  is  difficult  and  in  exploring  possible 
dues  about  how  it  occurs.  In  adcVtion,  there  are  a  few  things  to  be  said  about 
the  physiological  constraints. 

What  is  known  about  the  physiology  Is  very  little  indeed,  but  we  do  know  the 
fdlowing.  The  lowest  level  of  analysis  of  the  auditory  signal  is  apparently  a  cod¬ 
ing  of  the  frequency  spectrum  present  In  the  input.  There  is  also  evidence  of 
some  single-unit  detectors  in  lower-order  mammals  for  transitions  in  frequency 
either  upward  or  downward,  and  some  single  units  respond  to  frequency  transi¬ 
tions  away  from  a  particular  target  frequency  (Whitfield  &  Evans,  1966). 
Whether  such  single  units  actually  correspond  to  functional  detectors  for  these 
properties  is  of  course  highly  debatable,  but  the  sparse  evidence  Is  at  least  con¬ 
sistent  with  the  notion  that  there  are  detectors  for  properties  of  the  acoustic 
sKpwl  beginning  at  the  lowest  level  with  detectors  for  the  particular  frequencies 
present  In  the  signal.  Detectors  may  well  be  dstributed  over  large  populations  of 
actual  neurons,  of  course. 

More  fundamerrtafy,  we  know  that  the  brain  is  a  highly  interconnected  sys¬ 
tem.  The  number  of  neurons  In  the  cortex  (conservatively,  1 0  billion)  Is  not  nearly 
as  twpresatve  as  the  number  of  synapses— perhaps  as  many  as  10  .  The 
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connectivity  of  cortical  cels  is  such  that  a  change  of  state  In  one  area  is  ttkeiy  to 
ktfluence  neurons  over  a  very  wide  region. 

We  know  also  that  neuronal  conductivity  is  relatively  slow,  compared  with 
ctgttal  computers.  Instruction  cycle  times  of  digital  computers  are  measured  on 
the  order  of  nanoseconds;  neuronal  transmission  times  are  measured  on  the  order 
of  mUsoconda.  Where  does  the  power  of  the  human  brain  come  from,  then?  We 
suggest  it  derives  from  at  least  these  two  factors:  the  interconnectedness  of 
the  system,  and  the  abfltty  to  access  memories  by  content.  Content  addressable 
memory  means  that  information  can  be  accessed  cfirectiy  instead  of  accessed 
through  a  sequential  seen  of  randomly  ordered  Items. 

This  leads  us  toward  a  model  which  is  explicitly  designed  to  deal  with  ail  of 
the  constraints  outlned  above.  We  have  adopted  the  following  "design  princi¬ 
ples:" 


e  The  model  should  be  capable  of  producing  behavior  which  Is  as  simi¬ 
lar  as  possible  to  human  speech  perception.  We  consider  experi¬ 
mental  data  to  be  very  important  In  providing  constraints  and  dues 
as  to  the  model's  design.  The  model  should  not  only  perform  as  well 
ee  humans,  but  as  paoriy  in  those  areas  where  humans  f  al. 

e  The  model  should  be  constructed  using  structures  and  processes 
which  are  plausible  given  what  we  know  about  the  human  nervous 
system.  We  do  not  claim  that  the  model  is  an  image  of  those  neu¬ 
ronal  systems  which  are  actually  used  In  humans  to  perceive 
speech,  since  we  know  next  to  nothing  about  these  mechanisms. 
But  we  have  found  that  mechanisms  which  are  inspired  by  the 
structure  of  the  nervous  system  offer  considerable  promise  for  pro¬ 
viding  the  kind  of  paralel  Information  processing  which  seems  to  be 
necessary. 


e  The  model  should  not  be  constrained  by  the  requirement  that  com¬ 
puter  simulations  run  In  real  time.  Parallel  processes  can  be  simu¬ 
lated  on  a  serial  digital  machine,  but  not  at  anything  approaching 
real-time  rates.  The  goal  of  real  time  operation  at  this  point  would 
be  counter-productive  and  would  lead  to  undesirable  compromises. 


The  COHORT  Model 

Our  initial  attempt  to  construct  a  model  which  met  these  requirements  was 
caled  the  COHORT  model,  and  it  was  an  attempt  to  Implement  the  model  of  that 
name  proposed  by  Marelsn  WBeon  end  Welsh  (1078).  Of  course,  In  implementing 
the  model  many  detafts  had  to  be  worked  out  which  were  not  specified  In  the  ori¬ 
ginal,  so  the  originators  of  the  basic  concept  cannot  be  held  response  for  all  of 
the  model's  shortcomings.  COHORT  was  designed  to  perceive  word  Input,  with 
the  Input  specified  In  tense  of  time-  and  strength-varying  distinctive  features,  it 
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is  based  on  a  lexicon  of  the  3846  most  common  words  (occurring  10  or  more 
times  per  mHHon)  from  the  Kucera  &  Francis  corpus  (Kucera  &  Francis*  1967). 

Each  of  the  features,  phonemes,  and  words  is  represented  by  a  node.  Nodes 
have  roughly  the  same  computational  power  as  is  traditionaly  ascribed  to  a  neu¬ 
ron.  Each  node  has.- 

—an  associated  /eve/  of  activation  which  varies  over  time.  These  levels  may 
range  from  some  minimum  value  usually  near  -.2  or  -.3  to  a  maximum, 
usually  set  at  .0; 

-.a  threshold  (equal  to  0);  when  a  node's  activation  level  exceeds  this 
threshold  It  enters  what  is  called  the  active  state  and  begins  to  sig¬ 
nal  Its  activation  value  to  other  units; 

-.Its  own  (sub-threshold)  resting  /eve/  of  activation  to  which  it  returns  in 
the  absence  of  any  external  Inputs. 

Each  node  may  be  linked  to  other  nodes  in  a  non-random  manner.  These  con¬ 
nections  may  be  either  excitatnry  or  Inhibitory.  When  a  node  becomes  active,  ft 
excites  those  nodes  to  which  it  has  excitatory  connections,  and  inhibits  nodes  to 
which  It  has  inhfcitory  connections  by  an  amount  proportional  to  how  strongly  its 
activation  exceeds  threshold.  These  connections  have  associated  weightings, 
such  that  some  inputs  may  have  relatively  greater  Impact  on  a  node  than  others. 

A  node's  current  activation  level  reflects  several  factors:  (1 )  the  node's  ini¬ 
tial  resting  level;  (2)  the  spatial  and  temporal  summation  of  previous  inputs  (exci¬ 
tatory  and  inhibitory);  and  (3)  the  node's  rate  of  decay. 

A  fragment  of  the  system  Just  described  Is  illustrated  in  Figure  1.  At  the 
lowest  level  we  see  the  nodes  for  the  acoustic/phonetic  features.  COHORT 
stakes  use  of  a  set  of  22  nodes  for  1 1  bipolar  features  which  are  modifications  of 
the  Jakobsonlan  distinctive  features  (Jakobson,  Fant,  &  Halle,  1052).  These 
nodes  are  activated  directly  by  the  input  to  the  model  (described  below).  The 
features  were  chosen  for  the  initial  working  model  for  several  reasons.  They 
have  proven  useful  In  the  description  of  certain  linguistic  phenomena  (such  as 
sound  change)  which  suggests  they  have  some  psychological  realty;  the  Jakob- 
sonian  features  are  defined  in  (sometimes  vague)  acoustic  terms;  and  recent 
work  by  Blumstein  and  Stevens  (1980;  Stevens  8  Bkimstein,  1981)  appears  u> 
confirm  that  some  of  the  features  might  serve  as  models  for  more  precise  acous¬ 
tic  templates. 

At  the  next  higher  leva!  are  the  nodes  for  phonemes.  COHORT  has  nodes  for 
37  different  phonemes,  inchKlng  an  abstract  unit  which  marks  the  end  of  words. 
AN  phonemes  except  the  end  of  work  marker  receive  excitatory  Inputs  from  those 
features  which  signal  their  presence.  Thus,  the  node  for  /p/  la  activated  by  input 
from  the  nodes  qmvb,  compact,  consonantal,  oml,  vo  tee  less,  eta 
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Figure  1 .  Fragment  of  the  COHORT  system.  Nodes  exist  for  features,  phonemes, 
and  words.  The  word  nodes  have  a  complex  schema  associated  with  them,  shown 
here  only  for  the  word  bliss.  Connections  between  nodes  are  Indicated  by  arcs;  exci¬ 
tatory  connections  terminate  in  arrows  and  inhibitory  connections  terminate  in  filled 
circles. 
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Before  descrlblno  the  word  nodes,  a  comment  Is  in  order  regarding  the 
features  and  phonemos  which  were  used  in  COHORT.  These  choices  represent 
Initial  simplifications  of  very  complcated  theoretical  Issues,  which  we  have 
chosen  not  to  broach  at  the  outset  Our  goal  has  been  to  treat  the  model  as  a 
starting  place  for  examining  a  number  of  computational  Issues  which  face  the 
development  of  adequate  models  of  speech  perception,  and  it  is  our  belief  that 
many  of  these  issues  are  independent  of  the  exact  nature  of  the  assumptions  we 
make  about  the  features.  The  Jakobsovtan  feature  set  was  a  convenient  starting 
point  from  this  point  of  view,  but  It  should  be  dear  that  the  features  in  later  ver¬ 
sions  of  the  model  wH  need  substantial  revision.  The  same  caveat  is  true  regard¬ 
ing  the  phonemes.  It  is  even  conceivable  that  some  other  type  of  unit  wil  ulti¬ 
mately  prove  better.  Again,  to  some  degree,  the  precise  nature  of  the  unit 
(phoneme,  demisyllable,  context-sensitive  altophone,  transeme,  etc.)  Is  dissoci¬ 
able  from  the  structure  In  which  It  is  embedded. 

It  might  be  argued  that  other  choices  of  units  would  simplify  the  problem  of 
speech  perception  considerably  and  make  It  unnecessary  to  Invoke  the  complex 
computational  mechanisms  we  wHI  be  discussing  below.  Indeed,  some  of  the  units 
which  have  been  proposed  as  alternatives  to  phonemes  have  been  suggested  as 
answers  to  the  problem  of  context-sensitive  variation.  That  is,  they  encode — 
frozen  Into  their  definition— variations  which  are  due  to  context.  For  example, 
context-sensitive  allop  hones  (VMckeigren,  1 969)  attempt  to  capture  differences 
the  the  realzatlons  of  particular  phonemes  in  different  contexts  by  imagining  that 
there  Is  a  different  unit  for  each  dfferent  context.  We  think  this  merely  post¬ 
pones  a  problem  which  Is  pervasive  throughout  speech  perception.  In  point  of 
fact,  none  of  these  alternatives  is  able  to  truly  solve  the  variablity  which 
extends  over  broad  contexts,  or  which  b  due  to  speaker  differences,  or  to 
changes  in  rate  of  articulation.  For  this  reason  we  decided  to  begin  with  units 
(phonemes)  which  are  frankly  context-insensitive,  and  to  see  if  their  variablity  in 
the  speech  stream  could  be  dealt  with  through  the  processing  structures. 

Let  us  turn  now  to  the  word  nodes.  Words  present  a  special  problem  for 
COHORT.  This  is  because  words  contain  internal  structure.  In  the  current  version 
of  the  system,  this  structure  is  limited  to  phonemes,  but  it  is  quite  Hkely  that  word 
structure  also  contains  information  about  morphemes  and  possibly  syllable  boun¬ 
daries.  To  account  for  the  fact  that  words  are  made  up  of  ordered  sequences  of 
phonsmss,  It  seems  reasonable  to  assume  that  the  perceiver's  knowledge  of 
words  specifies  this  sequence 

Word  nodes  are  thus  complex  structures.  A  node  network  which  depicts  a 
word  structure  Is  shown  for  the  word  bliss  In  Figure  1 .  The  schema  consists  of 
several  nodes,  one  for  each  of  the  phonemes  In  the  word,  and  one  for  the  word 
Itself.  The  former  are  cafted  token  nodes,  sines  there  is  one  for  each  occurrence 
of  each  phoneme  in  the  word.  The  latter  Is  simply  catted  the  word  node.  At  the 
end  of  each  word  there  Is  a  special  token  node  corresponding  to  a  word  boundary. 
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Token  nodes  have  several  types  of  connections.  Token-word  connections 
permit  tokens  to  excite  their  wore  node  as  they  become  active  (pass  threshold). 
Word-token  Inks  allow  the  word  node  to  excite  its  constituent  tokens.  This 
serves  both  to  reinforce  tokens  which  may  have  already  received  bottom-up 
input,  as  well  as  to  prime  tokens  that  have  not  yet  been  "heard.”  Phoneme-token 
connections  provide  bottom-up  activation  for  tokens  from  phonemes.  Finally, 
token-token  connections  let  active  tokens  prime  successive  tokens  and  keep  pre¬ 
vious  tokens  active  after  their  bottom-up  input  has  dsap  peered.  Because 
Isteners  have  some  expectation  that  new  input  wil  match  word  beginnings,  the 
first  token  node  of  each  word  has  a  sUghtty  higher  resting  level  than  the  other 
tokens.  (In  some  simulations,  we  have  also  set  the  second  token  node  to  an 
kvtermecHate  level,  lower  than  the  first  and  higher  than  the  remaining  tokens). 
Once  the  first  token  passes  threshold,  it  excites  the  next  token  in  the  word.  This 
priming,  combined  with  the  order  in  which  the  input  actually  occurs,  Is  what  per¬ 
mits  the  system  to  respond  differently  to  the  word  pot  than  to  top. 

In  addition  to  internal  connections  with  their  token  nodes,  word  nodes  have 
Inhibitory  connections  with  all  other  word  nodes.  This  inhibition  reflects  competi¬ 
tion  between  word  candidates.  Words  which  match  the  input  will  compete  with 
other  words  which  do  not,  and  win  drive  their  activation  levels  down. 


Word  recognition  in  COHORT 

To  further  illustrate  how  COHORT  works,  we  wifl  describe  what  is  involved  in 
recognizing  the  word  siender . 

COHORT  does  not  currently  have  the  capabiHty  for  extracting  features  from 
real  speech,  so  we  must  provide  it  with  a  hand-constructed  approximation  of 
those  features  which  would  be  present  in  the  word  siender.  Also,  since  the  model 
is  simulated  on  a  digital  computer,  time  is  represented  as  a  series  of  cRscrete 
samples.  During  each  sampling  period  COHORT  receives  a  »st  of  those  features 
which  might  be  present  during  that  portion  of  the  word.  These  features  have 
time-varying  strengths.  To  simulate  one  aspect  of  coarticulation,  the  features 
overlap  and  rise  and  fal  in  strength. 

At  the  beginning  of  the  simulation,  ail  nodes  in  the  system  are  at  their  resting 
levels.  During  the  first  few  sampling  periods  the  feature  nodes  receive  activation 
from  the  input,  but  their  activation  levels  remain  below  threshold.  Eventualy,  how¬ 
ever,  some  feature  nodes  become  active  and  begin  to  excite  al  the  phonemes 
eMch  contain  them,  in  the  present  example,  activation  of  the  features  for  /s/ 
results  in  excitation  of  the  phonemes  /z/,  /f/,  and  /v/  as  well  as  /«/.  This  Is 
because  these  other  phonemes  closely  resemble  /s/  and  contain  many  of  the 
seme  features.  The  /s/,  however,  Is  most  strongly  activated. 

The  next  thing  that  happens  is  that  active  phonemes  excite  their 
corresponding  token  nodes  in  ail  the  words  that  contain  those  phonemes.  Initial 
token  nodes  (such  as  the  /s/  in  siender)  are  more  likely  to  pass  threshold  than 
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word-internal  nodes  (such  as  the  /s/  in  twist)  since  these  nodes  have  higher 
resting  levels.  When  the  token  nodes  become  active,  they  begin  to  activate  word 
nodes  and  also  their  successor  token  nodes.  Of  course,  while  alt  this  happens, 
Input  continues  to  provide  bottom-up  excitation. 

As  time  goes  on,  the  Internal  connections  begin  to  play  an  increasing  role  in 
determining  the  state  of  the  system.  Once  word  nodes  become  active  they  pro¬ 
vide  a  strong  source  of  top-down  excitation  for  their  token  nodes  and  also  com¬ 
pete  with  one  another  via  inhibitory  connections.  Early  in  the  input  there  may  be 
many  words  which  match  the  input  and  are  activated.  These  will  compete  with  one 
another  but  none  will  be  able  to  dominate;  however,  they  will  drive  down  the 
activations  of  other  words.  Those  words  which  fail  to  continue  to  receive 
bottom-up  excitation  wil  fail  away,  both  through  their  own  decay  and  through 
Inhibition  from  more  successful  candidates.  Eventually  only  a  single  word  will 
remain  active  and  will  push  down  the  activation  levels  of  unsuccessful  word 
nodes. 


One  can  monitor  these  events  by  examining  the  activation  levels  of  the  vari¬ 
ous  types  of  nodes  in  the  system.  In  Figure  2,  for  example,  we  see  a  graph  of  the 
activation  levels  of  word  nodes,  given  Input  appropriate  to  the  word  slender.  At 
time  tg  the  word  nodes1  activation  levels  rest  just  below  threshold.  During  the 
first  16  or  so  time  cycles  the  activation  levels  remain  constant,  since  it  takes  a 
while  for  the  feature,  phoneme,  and  token  nodes  to  become  active  and  excite 
the  word  nodes.  After  this  happens  a  large  number  of  words  become  active. 
These  are  ail  the  words  which  begin  with  the  phoneme  /s/.  Shortly  after  the 
25th  cycle  features  for  the  phoneme  /I/  are  detected  and  words  such  as  send 
fall  away,  but  other  words  such  as  slim  remain  active.  When  the  /e/  Is  detected 
slim  and  similar  words  are  inhibited.  At  the  end  only  siender  remains  active. 

This  simulation  reveals  two  interesting  properties  of  COHORT.  First,  we  note 
that  occasionally  new  words  such  as  /end  and  endless  join  the  cohort  of  active 
words.  Even  though  they  do  not  begin  with  /s/  they  resemble  the  input  enough  to 
reach  threshold.  We  regard  this  as  desirable  because  it  is  clear  that  human 
Ksteners  are  able  to  recover  from  initial  errors.  One  problem  we  have  found  in 
other  simulations  Is  that  COHORT  does  not  display  this  behavior  consistently 
enough. 

Secondly,  we  see  that  the  word  node  for  siender  begins  to  dominate  surpris- 
hgly  early  in  time.  In  fact,  it  begins  to  dominate  at  just  the  point  where  It  pro¬ 
vides  a  unique  match  to  the  input.  This  agrees  with  Marsien-Wilson's  (1980) 
claim  that  words  are  recognized  at  the  point  where  they  become  uniquely  identifi¬ 
able. 

We  can  also  monitor  the  activation  levels  of  the  tokens  within  the  word 
schema  for  slender,  as  shown  In  Figure  3.  At  time  tQ  all  tokens  are  below  thres¬ 
hold,  although  /s/  Is  near  threshold  and  the  /I/  is  also  slightly  higher  than  the 
remaining  tokens.  (Recall  that  the  initial  tokens  have  higher  resting  levels, 
reflecting  perceiver's  expectations  for  hearing  first  sounds  first.)  The  /a/  token 
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WORD  ACTIVATIONS 


Figure  2.  Activation  levels  of  selected  word  nodes,  given  feature  inputs  ap¬ 
propriate  for  the  word  slender.  At  the  start  all  words  which  begin  with  s  are  activat¬ 
ed.  As  time  goes  on  only  those  words  which  more  closely  resemble  the  input  remain 
active;  other  words  are  decay  and  are  also  inhibited  by  the  active  nodes.  Finally  only 
the  node  for  slender  dominates. 


ACTIVATION 
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SLENDER  TOKEN  ACTIVATIONS 


Figure  3.  Activations  of  the  token  nodes  associated  with  slender ,  given  input 
appropriate  for  this  word. 
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passes  threshold  fairly  quickly.  When  It  becomes  active  it  excites  both  the 
slender  word  node  and  also  the  next  token  In  the  word,  /l/.  After  more  cycles,  the 
/I/  token  begins  to  receive  bottom-up  Input  from  the  feature  nodes,  and  the 
/s/'s  feature  Input  decreases. 

The  same  basic  pattern  continues  throughout  the  rest  of  the  word,  with  some 
differences.  The  level  of  nodes  rises  slowly  even  before  they  receive  bottom-up 
input  and  become  active  This  occurs  because  the  nodes  are  receiving  lateral 
priming  from  earfier  tokens  in  the  word,  and  because  once  the  word  node  becomes 
active  it  primes  ail  its  constituent  token  nodes.  This  lateral  and  top-down  excita¬ 
tion  is  also  responsible  for  the  tendency  of  token  nodes  to  Increase  again  after 
decaying  once  bottom-up  input  has  ceased  (for  example,  /s/'s  level  starts  to 
decay  at  cycle  26,  then  begins  to  increase  at  cycle  30).  By  the  end  of  the  word, 
all  the  tokens  are  very  active,  despite  the  absence  of  any  bottom-up  excitation. 

This  example  demonstrates  how  COHORT  deals  with  two  of  the  problems  we 
noted  in  the  first  section.  One  of  these  problems,  it  will  be  recalled,  is  the 
spreading  of  features  which  occurs  as  a  result  of  coarticulation.  At  any  single 
moment  in  time,  the  signal  may  contain  features  not  only  of  the  ’’current*'  phoneme 
but  also  neighboring  phonemes.  In  the  current  version  of  COHORT  we  provide  the 
simulation  with  hand-constructed  input  in  which  this  feature  spreading  is  artifi¬ 
cially  mimicked.  Because  COHORT  Is  able  to  activate  many  features  and 
phonemes  at  the  same  time,  this  coarticulation  helps  the  model  anticipate 
phonemes  which  may  not,  properly  speaking,  be  fully  present.  In  this  way  coarti- 
cuiation  is  treated  as  an  aid  to  perception,  rather  than  as  a  source  of  noise. 
While  the  sort  of  artificial  input  we  provide  obviously  does  not  provide  the  same 
level  of  difficulty  which  is  present  in  real  speech,  we  believe  that  COHORT’S 
approach  to  dealing  with  these  momentary  aspects  of  coarticulation  is  on  the 
right  track. 

A  second  problem  faced  by  many  speech  recognition  systems  is  that  of  seg¬ 
mentation:  How  do  you  locate  units  in  a  signal  which  contains  few  obvious  unit 
boundaries?  For  COHORT  this  problem  simply  never  arises.  As  the  evidence  for 
different  phonemes  waxes  and  wanes,  the  activation  levels  of  phonemes  and 
tokens  rises  and  fads  in  continuous  fashion.  Tokens  which  are  activated  in  the 
right  sequence  (Le.,  belong  to  real  words)  activate  word  nodes,  which  are  then 
able  to  provide  an  additional  source  of  excitation  for  the  tokens.  At  the  end  of 
the  process,  all  the  phoneme  tokens  of  the  word  that  has  been  heard  are  active, 
but  there  is  no  stage  during  which  explicit  segmentation  occurs. 

In  addtlon  to  these  two  characteristics,  COHORT  can  be  made  to  simulate 
two  phenomena  which  have  been  observed  experimentally  in  human  speech  per¬ 
ception.  The  first  of  these  phenomena  is  phonemic  restoration. 

The  human  speech  processing  system  is  capable  of  perceiving  speech  in  the 
face  of  considerable  noise.  This  ability  was  studied  in  an  experiment  by  Warren 
(1970).  Warren  asked  subjects  to  listen  to  tapes  containing  naturally  produced 
words  In  which  portions  of  the  words  had  been  replaced  by  noise.  Warren  found 
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that,  although  subjects  were  aware  of  the  presence  of  noise,  they  were  unaware 
that  any  part  of  the  original  word  had  been  deleted  (fn  fact,  they  were  usually 
unable  to  say  where  in  the  word  the  noise  occurred).  Samuel  (in  press)  has  repli¬ 
cated  and  extended  these  using  a  signal  detection  paradigm.  (In  Samuel's 
experiments,  some  stimul  have  phonemes  replaced  by  noise  and  other  stimuli 
have  noise  added  in i  The  subjects'  task  is  to  determine  whether  the  phoneme  is 
present  or  absent.)  One  of  Samuel's  important  findings  is  that  vhis  phenomenon, 
phonemic  restoration,  actually  completes  the  percept  so  strongly  that  It  makes 
subjects  insensitive  to  the  distinction  between  the  replacement  of  a  phoneme  by 
noise  and  the  mere  addition  of  noise  to  an  intact  speech  signal.  Listeners  actu¬ 
ally  perceive  the  missing  phonemes  as  if  they  were  present. 

We  were  interested  in  seeing  how  COHORT  would  respond  to  stimuli  in  which 
phonemes  were  missing.  To  do  this,  we  prepared  input  protocols  in  which  we 
turned  off  feature  Input  during  those  cycles  which  corresponded  in  time  to  a  par¬ 
ticular  phoneme.  In  one  of  these  simulations,  we  deleted  ail  feature  input  for  the 
/d/  of  slender .  (Note  that  this  differs  sightly  from  the  standard  phonemic  res¬ 
toration  experiment,  in  which  noise  is  added  to  the  signal  after  a  phoneme  is 
deleted.) 

In  Figure  4  we  observe  the  activations  of  the  slender  token  nodes  which 
result  from  this  input.  These  levels  may  be  compared  with  those  in  Figure  3.  There 
are  no  obvious  differences  between  the  two  conditions.  The  /d/  token  succeeds 
in  becoming  active  despite  the  absence  of  bottom-up  Input.  This  suggests  that 
the  token-token  priming  and  the  top-down  excitation  from  word  to  token  is  a 
powerful  force  during  perception. 

Figure  6  compares  the  word  node  activation  for  slender  with  and  without  /d/ 
input.  The  two  patterns  are  remarkably  alike.  COHORT  appears  to  respond  much 
as  human  perceivers  do  given  similar  input  —  the  distinction  between  the  pres¬ 
ence  and  the  absence  of  the  /d/  is  lost  in  context. 

A  second  phenomenon  we  attempted  to  replicate  with  COHORT  was  the  lexi¬ 
cal  bias  fn  phoneme  identification  first  noted  by  Ganong  (1980).  As  previously 
mentioned,  Ganong  dtocoverad  that  if  listeners  are  asked  to  identify  the  Initial 
consonant  in  stimul  which  range  perceptually  from  a  word  to  a  nonword,  the 
phoneme  boundary  Is  displaced  toward  the  word  end  of  the  continuum,  compared 
with  its  location  on  a  non-word/word  continuum.  In  short,  lexical  status  affects 
perception  at  the  level  of  phonetic  categorization. 

In  order  to  simulate  this  experiment,  we  presented  COHORT  with  Input  which 
corresponded  to  a  word-initial  bilabial  stop,  followed  by  features  for  the 
sequence  The  feature  values  for  the  bilabial  stop  were  adjusted  in  such  a 
way  as  to  make  It  indeterminate  for  voicing;  it  sounded  midway  between  bar  and 
par.  Although  COHORT  knows  the  word  bar,  it  does  not  have  par  In  its  lexicon,  so 
per  Is  effectively  a  nonword  for  the  purposes  of  the  simulation. 
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SLEN-ER  TOKEN  ACTIVATIONS 


Figure  4.  Activations  of  the  token  nodes  associated  with  slender ,  given  input 
appropriate  for  this  word. 
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SLENDER  —  with  and  without  the  /d/ 


Figure  5.  Activation  levels  of  the  slender  word  node  for  input  in  which  the  d  is 
present  (solid  line),  compared  to  when  the  d  is  absent  (broken  line). 
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The  simulation  differed  from  Ganong’s  experiment  in  that  he  measured  the 
phoneme  boundary  shift  by  presenting  a  series  of  stimuli  to  subjects  and  then 
calculating  the  boundary  as  the  location  of  the  50%  labelling  crossover.  In  our 
experiment  we  were  able  to  present  the  model  with  a  stimulus  which  should  have 
been  exactly  at  the  phoneme  boundary,  assuming  a  neutral  context  (e.g.,  If  the 
stimulus  had  been  a  nonsense  syllable  such  as  ba  or  pa  rather  than  a  potential 
word).  The  way  we  determined  whether  or  not  a  lexical  effect  similar  to  Ganong’s 
had  occurred  was  to  examine  the  activation  levels  of  the  /b/  and  /p/  phoneme 
nodes. 


Figure  6  shows  the  activation  levels  of  these  two  nodes  over  the  time 
course  of  process  fang  the  input  stimulus.  Both  nodes  become  highly  activated  dur¬ 
ing  the  first  part  of  the  word.  This  is  the  time  when  bottom-up  input  is  providing 
equal  activation  for  both  voiced  and  voiceless  bilabial  stops.  Once  the  bottom-up 
input  is  gone,  both  levels  decay.  What  Is  of  interest  is  that  the  /b/  node  remains 
with  a  higher  level  of  activation  We  assume  that  this  higher  level  would  be 
reflected  in  a  boundary  shift  on  an  phoneme  identification  test  toward  the  voiced 
end  of  the  continuum. 

When  we  think  about  why  COHORT  displays  this  behavior— behavior  which  Is 
similar  to  those  of  Ganong's  human  subjects— we  realize  that  the  factors  respon¬ 
sible  for  the  greater  activation  of  the  /b/  node  are  essentially  the  same  which 
cause  phonemic  restoration.  Top-down  excitation  from  the  word  level  exerts  a 
strong  influence  on  perception  at  the  phoneme  level. 

This  realization  leads  to  an  interesting  prediction.  Because  the  lexical  effect 
reflects  the  contribution  of  top-down  information,  it  should  be  the  case  that  when 
the  target  phoneme  (i.e.,  the  one  to  be  identified)  occurs  later  in  the  word,  rather 
than  at  the  beginning  as  is  the  case  with  the  bar /par  stimulus.,  the  difference  in 
activations  of  the  two  competing  nodes  should  be  magnified.  This  is  because  the 
word  node  has  had  longer  to  build  up  its  own  activation  and  is  therefore  able  to 
provide  greater  support  for  the  phoneme  which  is  consistent  with  it. 

Figure  7  demonstrates  that  COHORT  does  indeed  perform  In  this  manner.  We 
presented  the  simulation  with  Input  appropriate  to  the  sequence  ra_followed  by  a 
bilabial  stop  that  was  again  intermediate  with  regard  to  voicing,  rob  is  a  word  in 
COHORT'S  lexicon,  but  rop  is  not,  so  we  would  expect  a  greater  level  in  activa¬ 
tion  for  /b/  than  for  /p/,  based  on  top-down  excitation. 

This  indeed  occurs.  But  what  we  also  find  Is  that  the  magnitude  of  the  differ¬ 
ence  is  sRghtiy  greater  than  when  the  target  phoneme  occurs  at  the  beginning  of 
the  word.  The  fincfing  has  not  yet  been  tested  with  human  perceivers,  but  It  Is 
consistent  with  other  findings  mentioned  above  (Cole  &  Jakimik,  1978,  1980; 
Mar sien- Wilson  &  Welsh,  1978)  which  point  to  greater  top-down  effects  at  word 
endings  than  at  word-beginnings. 
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Figure  6.  Activation  of  b  and  p  phoneme  nodes,  given  feature  input  for  the  se¬ 
quence  bilabial  stop+a+r,  in  which  the  stop  Is  indeterminate  for  voicing.  Since  the 
lexicon  contains  the  word  bar  but  not  par ,  top-down  excitation  favors  the  perception 
of  the  stop  as  voiced. 
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Figure  7.  Activation  of  b  and  p  phoneme  nodes,  given  feature  input  for  the  se¬ 
quence  r+e+b/Zab/af  stop,  in  which  the  stop  Is  indeterminate  for  voicing.  The  lexicon 
contains  the  word  rob,  but  not  rop,  so  the  b  node  becomes  more  activated  than  the  p 
node. 
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In  simulating  Ganong's  lexical  effect  on  the  phoneme  boundary,  we  added  a 
provision  to  COHORT  which  was  not  provided  for  by  Marslen-Witson  and  Welsh 
(1978):  Feedback  from  the  word  to  the  phoneme  level.  They,  along  with  Morton 
(1979)  have  accounted  for  lexical  and  other  contextual  effects  on  phoneme 
identification  in  terms  of  a  two  step  process,  In  which  context  affects  word  iden¬ 
tification,  and  then  the  phonological  structure  of  the  word  is  unpacked  to  deter¬ 
mine  what  phonemes  It  contains. 

The  alternative  we  prefer  is  to  permit  feedback  from  the  words  to  actually 
influence  activations  at  the  phoneme  level.  In  this  way,  partial  activations  of 
words  can  influence  perception  of  nonwords. 

The  addition  of  fee<ft>ack  from  the  words  to  the  phoneme  level  in  cohort 
raises  a  serious  problem,  however.  If  the  feedback  is  strong  enough  so  that  the 
phoneme  nodes  within  a  word  are  kept  active  as  the  perceptual  process  unfolds, 
then  all  words  sharing  the  phonemes  which  have  been  presented  continue  to 
receive  bottom-up  support  and  the  model  begins  to  loose  its  ability  to  cflstkiguish 
words  having  the  same  phonemes  In  them  In  different  orders.  This  and  other  prob¬ 
lems,  to  be  reviewed  below,  have  lead  us  to  a  different  version  of  an  interactive 
activation  model  of  speech  perception,  called  TRACE. 


The  TRACE  Model 

Given  COHORT'S  successes,  one  might  be  tempted  to  suggest  that  it  may  be 
feedback  to  the  phoneme  level,  and  not  the  rest  of  the  assumptions  of  COHORT 
which  are  In  error.  However,  there  are  other  problems  as  wen  with  this  version  of 
the  model.  First,  words  containing  multiple  occurrences  of  the  same  phoneme 
present  serious  problems  for  the  model.  The  first  occurrence  of  the  phoneme 
primes  aH  the  tokens  of  this  phoneme  in  words  containing  this  phoneme  anywhere. 
Then  the  second  occurrence  pushes  the  activations  of  all  of  these  tokens  Into 
the  active  range.  The  result  is  that  words  containing  the  repeated  phoneme  any¬ 
where  In  the  word  become  active.  At  the  same  time,  all  words  containing  multiple 
occurrences  of  the  twice-active  phoneme  get  so  strongly  activated  that  the 
model's  abilty  to  distinguish  between  them  based  on  subsequent  (or  prior)  input 
is  dMnished.  A  second  difficulty  is  that  the  mode)  is  too  sensitive  to  the  dura¬ 
tions  of  successive  phonemes.  When  durations  are  too  short  they  do  not  allow  for 
sufficient  priming.  When  they  are  too  long  too  much  priming  occurs  and  the  words 
being  to  "run  away*'  independently  of  bottom-up  activation. 

In  essence,  both  of  these  problems  come  down  to  the  fact  that  COHORT 
uses  a  trick  to  handle  the  sequential  structure  of  words:  it  uses  lateral  priming  of 
one  token  by  another  to  prepare  to  perceive  the  second  phoneme  in  a  word  after 
the  first  and  so  on.  The  problems  described  above  arise  from  the  fact  that  this  Is 
a  highly  unrelabte  way  of  solving  the  problem  of  the  sequential  structure  of 
speech.  To  handle  this  problem  there  needs  to  be  some  better  way  of  directing 
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the  input  to  the  appropriate  place  in  the  word. 

Sweeping  the  Input  across  the  tokens.  One  way  to  handle  some  of  these 
problems  is  to  assume  that  the  input  is  sequentially  directed  to  the  successive 
tokens  of  each  word.  Instead  of  successive  priming  of  one  token  by  the  next,  we 
could  Imagine  that  when  a  token  becomes  active,  it  causes  subsequent  Input  to 
be  gated  to  its  successor  rather  than  Itself.  Alt  input,  of  course,  cotfd  be 
drected  Initialy  toward  the  first  token  of  each  word.  If  this  token  becomes 
active,  it  could  cause  the  input  to  be  redirected  toward  the  next  token.  This 
suggestion  has  the  interesting  property  that  it  automatic  ally  avoids  double 
activation  of  the  same  token  on  the  second  presentation  of  the  corresponding 
phoneme.  It  may  still  be  sensitive  to  rate  variations,  though  this  could  be  less  of 
a  problem  than  In  the  proceeding  model.  Within  word  filling  in  could  still  occur  via 
the  top-down  feedback  from  the  word  node,  and  of  course  this  would  take  a  while 
to  buld  up  so  would  be  more  IBcely  to  occur  for  later  phonemes  than  for  earlier 
ones. 

However,  this  scheme  shares  a  serious  problem  with  the  previous  one.  In  the 
absence  of  prior  context,  both  versions  depend  critically  on  clear  word  beginnings 
to  get  the  right  word  schemas  started.  We  suspect  that  it  Is  inferior  to  human 
perceh/ers  in  this  respect.  That  is,  we  suspect  that  humans  are  able  to  recognize 
words  correctly  from  their  endings  (in  so  far  as  these  are  unique)  even  when  the 
beginnings  are  sufficiently  noisy  so  that  they  would  produce  only  very  weak 
word-level  activations  at  first  and  thus  would  not  get  the  ball  rolling  through  the 
word  tokens. 

Generalized  sweeping .  A  potential  solution  to  this  problem  would  be  to 
sweep  the  input  through  all  tokens,  not  Just  those  in  which  the  Input  has  already 
produced  activations.  However,  it  is  not  clear  on  what  basis  to  proceed  with  the 
sweep.  If  it  were  possible  to  segment  the  input*  into  phonemes  then  one  could 
step  along  as  each  successive  phoneme  came  in;  but  we  have  argued  that  there 
is  no  segmentation  into  phonemes.  Another  possibility  is  to  step  along  to  the  next 
token  as  tokens  become  active  at  the  current  position  in  any  words.  Though  this 
does  not  require  explicit  segmentation  of  the  input,  it  has  its  drawbacks  as  wel. 
For  one  thing  It  means  that  the  model  Is  somewhat  rigidly  committed  to  its  position 
within  a  word.  It  would  be  dlfficutt  to  handle  cases  where  a  nonsense  beginning 
was  followed  by  a  real  word  (as  In,  say,  unpti cohort),  since  the  model  would  be 
dkectfng  the  ending  toward  the  ends  of  longer  words  rather  than  toward  begin¬ 
nings. 

The  memory  trace.  A  problem  with  all  of  the  schemes  considered  thus  far  Is 
that  they  have  no  memory,  except  within  each  word  token.  Patterns  of  activation 
at  the  phoneme  level  come  and  go  very  quickly  —  If  they  do  not,  confusion  sets 
in.  The  fact  that  the  memory  Is  all  contained  within  the  activations  of  the  word 
tokens  makes  It  hard  to  account  for  context  effects  in  the  perception  of  pseudo¬ 
words  (Samuel,  1979).  Even  when  these  stimul  are  not  recognized  as  words, 
missing  phonemes  which  are  predictable  on  the  basis  of  regularities  in  patterns  of 
phoneme  co-occurrance  are  nevertheless  filled  in.  Such  phenomena  suggest  that 
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there  is  a  way  of  retaining  a  sequence  of  phonemes  —  and  even  filling  in  missing 
pieces  of  It  —  when  that  sequence  does  not  form  a  word.  One  possibility  is  to 
imagine  that  the  activations  at  the  phoneme  level  are  read  out  into  some  sort  of 
post-identification  buffer  as  they  become  active  at  the  phoneme  level.  While  this 
may  account  for  some  of  the  pseudoword  phenomena,  retrospective  filling  in  of 
missing  segments  would  be  cfifficutt  to  arrange.  What  appears  to  be  needed  Is  a 
dynamic  memory  in  which  incomplete  portions  of  past  inputs  can  be  filled  In  as  the 
information  which  specifies  them  becomes  available.  The  TRACE  model  attempts 
to  incorporate  such  dynamic  memory  into  an  interactive  activation  system.  We  are 
only  now  in  the  process  of  implementing  this  mode)  via  a  computer  simulation,  so 
we  can  only  offer  the  following  sketch  of  how  it  will  work. 

We  propose  that  speech  perception  takes  place  within  a  system  which 
possesses  a  dynamic  representational  space  which  serves  much  the  same  func¬ 
tion  as  the  Blackboard  in  HEARSAY.  We  might  visualize  this  buffer  as  a  large  set 
of  banks  of  detectors  for  phonetic  features  and  phonemes,  and  imagine  that  the 
input  sweeps  out  a  pattern  of  activation  through  this  buffer.  That  is,  the  input  at 
some  Initial  time  tQ  would  be  directed  to  the  first  bank  of  detectors,  the  input  at 
the  next  time  sice  would  be  cHrected  to  the  next  bank,  and  so  on.  These  banks 
are  dynamic;  that  Is,  they  contain  nodes  which  interact  with  each  other,  so  that 
processing  will  continue  in  them  after  bottom-up  input  has  ceased.  In  addition  to 
the  Interactions  within  a  time  slice,  nodes  would  interact  across  slices.  Detectors 
for  mutually  incompatible  units  would  be  mutually  inhibitory,  and  detectors  for  the 
units  representing  an  item  spanning  several  slices  would  support  each  other 
across  sices.  We  assume  In  this  model  that  Information  written  into  a  bank  would 
tend  to  decay,  but  that  the  rate  of  decay  would  be  determined  by  how  strongly 
the  incoming  speech  pattern  set  up  mutually  supportive  patterns  of  activation 
within  the  trace. 

Above  the  phoneme  model,  we  presume  that  there  would  be  detectors  for 
words.  These,  of  course,  would  span  several  slices  of  the  buffer.  It  seems 
unreasonable  to  suppose  that  there  Is  an  existing  node  network  present  contain¬ 
ing  nodes  for  each  word  at  each  possible  starting  position  in  the  buffer.  It  seems 
then,  that  the  model  requires  the  capability  of  creating  such  nodes  when  It  needs 
them,  as  the  input  comes  in.  Such  nodes,  once  created,  would  be  Interact  with 
the  phoneme  buffers  In  such  a  way  as  to  insure  that  only  the  correct  sequence  of 
phonemes  wll  strongly  activate  them.  Thus,  the  created  node  for  the  word  cat 
starting  in  some  slice  will  be  activated  when  there  is  a  /c/  in  the  starting  slice 
and  a  few  subsequent  slices,  an  /a/  in  the  next  few  slices,  and  a  ft/  in  the  next 
few,  but  will  not  be  excited  (except  for  the  /a/)  when  these  phonemes  occur  in 
the  reverse  order. 

A  slmpRfled  picture  of  the  TRACE  model  is  shown  in  Figure  8.  Time  is 
represented  along  the  horizontal  axis,  with  successive  columns  for  individual 
memory  traces.  Within  each  trace  there  are  nodes  for  features  and  phonemes,  but 
only  phoneme  nodes  are  shown  here.  The  activation  level  of  each  of  these  nodes 
(and  of  the  word  nodes  above)  is  shown  as  a  horizontal  bar;  thicker  bars  indicate 
greater  levels  of  activation. 
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Figure  8.  Partial  view  of  toe  TRACE  system.  Time  Is  represented  along  the  hor¬ 
izontal  axis,  with  columns  for  succeeding  "traces.''  Each  trace  contains  nodes  for 
phoneme  and  feature  nodes  (only  the  phoneme  nodes  are  shown).  Input  Is  shown 
along  the  bottom  in  phonemic  form;  In  reality,  Input  to  the  phoneme  nodes  would  con¬ 
sist  of  excitation  from  the  feature  nodes  within  each  trace.  At  the  top  are  shown  the 
word  nodes  and  the  activations  they  receive  In  each  time  slice.  Because  the  Input 
can  be  parsed  In  various  ways,  several  word  nodes  are  active  simultaneously  and 
overlap. 
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Along  the  bottom  is  shown  sample  input.  The  input  is  presented  here  in 
phonemic  form  for  ease  of  representation;  it  would  actually  consist  of  the  excita¬ 
tions  from  the  (missing)  feature  nodes,  which  in  turn  would  be  excited  by  the 
speech  input. 

Because  the  input  as  shown  could  be  parsed  in  different  ways,  the  word 
nodes  for  s/ant,  /and,  and  bus  all  receive  some  activation,  slant  is  wot v  neavily 
activated  since  it  most  closely  matches  the  input,  but  the  sequence  bus  land  is 
also  entertained.  Presumably  context  and  higher-level  information  are  used  to 
provide  the  necessary  input  to  disambiguate  the  situation. 

In  this  model,  we  can  account  for  filling-in  effects  in  terms  of  top-down 
activations  of  phonemes  at  particular  locations  in  the  trace.  One  important  advan¬ 
tage  of  the  TRACE  model  is  that  a  number  of  word  tokens  partially  consistent  with 
a  stretch  of  the  trace  and  each  weakly  activating  a  particular  phoneme  could 
conspire  together  to  fiD  in  a  particular  phoneme  Thus  if  the  model  heard  flaggy, 
words  which  begin  with  f/u.~  such  as  fluster  and  flunk  would  activate  phoneme 
nodes  for  /f/,  /I/,  and  /e/  In  the  first  part  of  the  trace,  and  words  which  end 
with  . njuggy  such  as  buggy  and  muggy  would  activate  nodes  for  /~/,  /g/,  and  /i/ 
in  the  latter  part  of  the  trace.  In  this  way  the  model  could  be  made  to  account 
easily  for  filling  in  effects  in  pseudoword  as  well  as  word  perception. 

This  mechanism  for  using  the  lexicon  to  perceive  non-words  is  intriguing, 
because  it  suggests  that  some  of  the  knowledge  which  linguists  have  assumed  is 
represented  by  rules  might  located  in  the  lexicon  Instead.  Consider,  for  example, 
phonotactic  knowledge.  Every  language  has  certain  sequences  of  sounds  which 
are  permissible  and  others  which  are  not.  English  has  no  word  bilk,  but  it  might, 
whereas  most  speakers  of  English  would  reject  bnik  as  being  unacceptable.  One 
might  choose  to  conclude,  therefore,  that  speakers  have  rules  of  the  form 

*#bn 

(where  the  asterisk  denotes  ungrammaticallty,  and  §  indicates  word  beginning),  or 
more  generally 
* 

stop]  [nasal] 

But  in  fact,  TRACE  points  to  an  alternative  account  for  this  behavior.  If  percep¬ 
tion  of  both  words  and  nonwords  is  mediated  by  the  lexicon,  then  to  the  extent 
that  a  sequence  of  phonemes  in  a  nonword  occurs  in  the  real  words,  TRACE  will 
be  able  to  sustain  the  pattern  in  the  phoneme  traces.  If  a  sequence  does  not 
exist,  the  pattern  will  stM  be  present  in  the  trace,  but  only  by  virtue  of  bottom-up 
input,  and  weakly.  TRACE  predicts  that  phonotactic  knowledge  may  not  be  hard- 
and-fast  in  the  fashion  that  rule-governed  behavior  should  be.  Because  there  are 
some  sequences  which  are  uncommon,  but  which  do  occur  in  English  (e.g.,  initial  sf 
clusters)  listeners  should  be  able  to  judge  certain  nonwords  as  more  acceptable 
than  others;  and  this  is  in  fact  what  happens  (Greenberg  &  Jenkins,  1964). 
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Another  advantage  to  TRACE  Is  that  early  portions  of  words  would  still  be 
present  In  the  trace  and  so  would  remain  available  for  consideration  and  modifica¬ 
tion.  Ambiguous  early  portions  of  a  word  could  be  filled  in  retrospectively  once 
subsequent  portions  correctly  specified  the  word.  This  would  explain  Bsteners' 
tendencies  to  hear  an  [h]  in  the  phrase  __ee/  of  the  shoe  (Warren  &  Sherman, 
1974). 

The  TRACE  model  permits  more  ready  extension  of  the  interactive  activation 
approach  to  the  perception  of  multi-word  Input.  One  can  Imagine  the  difficulties 
which  would  be  presented  in  COHORT  given  input  which  could  be  parsed  either  as 
a  single  word,  or  several  smaller  words.  Consider,  for  example,  what  would  hap¬ 
pen  If  the  system  heard  a  string  which  could  be  interpreted  either  as  self  yet  tight 
or  cellulite.  Assume  that  later  input  will  disambiguate  the  parsing,  and  that  for 
the  time  being  we  wish  to  keep  both  possibilities  active.  Because  words  compete 
strongly  with  one  another  in  COHORT,  the  nodes  for  sell ,  your,  light ,  and  ce//u//te, 
will  all  b€  in  active  competition  with  one  another.  The  system  will  have  no  way  of 
knowing  vhat  the  competition  Is  really  only  between  the  first  three  of  these 
words— as  ^  group— and  the  last.  In  TRACE,  words  still  compete,  but  the  competi¬ 
tion  can  be  directed  toward  the  portion  of  the  input  they  are  attempting  to 
account  for. 
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CONCLUSIONS 


That  speech  perception  is  a  complex  behavior  is  a  claim  which  is  hardly  novel 
to  us.  What  we  hope  to  have  accomplished  here  is  to  have  shed  some  Igtrt  about 
exactly  what  it  is  about  speech  perception  which  makes  it  such  a  cffficuft  task 
to  model,  and  to  have  shown  why  interactive  activation  models  are  such  an 
appropriate  framework  for  speech  perception.  Our  basic  premise  is  that  attempts 
to  model  this  area  of  human  behavior  have  been  seriously  hampered  by  the  lack  of 
an  adequate  computational  framework. 

During  the  course  of  an  utterance  a  large  number  of  factors  interact  and 
shape  the  speech  stream.  While  there  may  be  some  acoustic  invariance  in  the 
signal,  such  invariance  seems  to  be  atypical  and  limited.  It  seems  dear  that 
attempting  to  untangle  these  interactions  within  human  information  processing 
frameworks  which  resemble  von  Neumann  machines  is  a  formidable  task.  Those 
computer-based  systems  which  have  had  any  success,  such  as  HARPY,  have 
achieved  real-time  performance  at  the  expense  of  flexibility  and  extensibility, 
and  within  a  tightly  constrained  syntactic  and  lexical  domain.  We  do  not  wish  to 
downplay  the  importance  of  such  systems.  There  are  certainly  many  applications 
where  they  are  very  useful,  and  by  IBustrating  how  far  the  so-called  "engineer¬ 
ing1’  approach  can  be  pushed  they  provide  an  important  theoretical  function  as 
weft. 


However,  we  do  not  believe  that  the  approach  inherent  in  such  systems  wBl 
ever  lead  to  a  speech  understanding  system  which  performs  nearly  as  wefl  as 
humans,  at  anywhere  near  the  rates  we  are  accustomed  to  perceiving  speech. 
There  is  a  fundamental  flaw  In  the  assumption  that  speech  perception  is  carried 
out  in  a  processor  which  looks  at  all  like  a  digital  computer.  Instead,  a  more  ade¬ 
quate  model  of  speech  perception  assumes  that  perception  is  carried  out  over  a 
large  number  of  neuron-lftce  processing  elements  in  which  there  are  extensive 
interactions.  Such  a  model  makes  sense  in  terms  of  theoretical  psychology;  we 
would  argue  that  it  wiB  ultimately  prove  to  be  superior  in  practical  terms  as  wei. 

In  this  chapter  we  have  described  the  computer  simulation  of  one  version 
(COHORT)  of  an  Interactive  activation  model  of  speech  perception.  This  model 
reproduces  several  phenomena  which  we  know  occur  In  human  speech  perception. 
It  provides  an  account  for  how  knowledge  can  be  accessed  in  paralel,  and  how  a 
large  number  of  knowledge  elements  in  a  system  can  Interact.  It  suggests  one 
method  by  which  some  aspects  of  the  encoding  due  to  coarticulation  might  be 
decoded.  And  it  demonstrates  the  paradoxical  feat  of  extracting  segments  from 
the  speech  stream  without  ever  doing  segmentation. 

COHORT  has  a  number  defects.  We  have  presented  an  outfine  of  another 
model,  TRACE,  which  attempts  to  correct  some  of  these  defects.  TRACE  shows 
that  it  is  possible  to  integrate  a  dynamic  working  memory  into  an  interactive 
activation  model,  and  that  this  not  only  provides  a  means  for  perceiving  nonwords 


Interactive  activation  In  speech  perception 
February  21, 1983 


Elman  &  McClelland 
39 


but  also  shows  that  certain  type  of  knowledge  can  be  stored  in  the  lexicon  which 
leads  to  what  looks  like  rule-governed  behavior. 

What  we  have  said  so  far  about  TRACE  is  only  its  beginning.  For  one  thing, 
the  process  by  which  acoustic/phonetic  features  are  extracted  from  the  signal 
remains  a  chaflenging  task  for  the  future.  And  we  have  yet  to  specify  how  the 
knowledge  above  the  word  level  should  come  Into  play.  One  can  imagine  schema 
which  correspond  to  phrases,  and  which  have  complex  structures  somewhat  lice 
words,  but  there  are  doubtless  many  possibilities  to  explore. 

It  is  clear  that  a  working  model  of  speech  perception  which  functions  any¬ 
where  nearly  as  well  as  humans  do  is  a  long  way  off.  We  do  not  claim  that  any  of 
the  versions  we  present  here  are  the  right  ones,  but  we  are  encouraged  by  the 
Hmited  success  of  COTORT  and  the  potential  we  see  In  TRACE.  The  basic 
approach  is  promising. 
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